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FOREWORD 


On behalf of the organising committee, I would like to welcome all the participants to the 5™ Inter- 
national Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 
2007, held 13-15 December 2007, in Firenze, Italy. 

Since 1999, the workshop has been held uninterruptedly every two years, aiming at stimulating con- 
tacts between specialists active in research and industrial developments, in the growing area of voice 
signals and image analysis for biomedical applications. The scope of the Workshop includes all aspects of 
voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications 
and related established and advanced technologies. 

The Workshop is unique in its aims and is largely interdisciplinary, concerning voice analysis under 
both biomedical and technical perspective. Participants spreading over the medical, engineering, physics 
and mathematical fields are given an interdisciplinary platform for presenting and discussing new knowl- 
edge in this field of research, both as far as adults and children voices are concerned. 

This fifth edition of the Workshop has gained great interest from the international scientific com- 
munity, with tenth of papers all of high scientific level, covering the most relevant fields of research in 
voice analysis. Specifically, and according to the aims of the Workshop, papers in the following main 
sections are presented: 


. Theoretical models 

. Pathology detection and classification 
Mechanical models 

Continuous speech and prosody 
A-laryngeal speech 

Newborn infant cry 

Neurological dysfunction 

Singing voice 

Non-human sounds 


CHNDARWNE 


Moreover, this 5 edition hosts two important sponsored events: 


e The Working Groups and Management Committee meetings of COST Action 2103 (President: Philippe 
de Jonckere, NL), “Advanced voice quality assessment”, which is one of the actions promoted by 
the intergovernmental network for European Cooperation in the field of Scientific and Technical re- 
search. 

e The meeting of representatives from the Editorial Board of the Elsevier Journal: Biomedical Signal 
Processing and Control, along with a “mini workshop” on the publication process for authors, to bring 
useful information and suggestions, especially for young researchers. 


Finally, I would like to thank the members of the organising committee and all the reviewers, who 
gave freely of their time to assess the highly disparate work of the workshop, helping in improving the 
quality of the papers. The Workshop has also benefited from the efforts of the administrative staff within 
our University, office for Research and International Relations, and the Department of Electronics and 
Telecommunications that contributed to make this workshop a successful one. 
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Special thanks to the Fiesole School of Music for their generous participation, and to both the direction 
and association of the Ospedale S. Giovanni di Dio, who has allowed the MAVEBA participants enjoying 
the wonderful monumental entrance of the oldest Hospital in Firenze. 

Great thankfulness goes to the supporters and sponsors, who confidently gave financial contribution 
to the MAVEBA workshop. 


Dr. Claudia Manfredi 
Conference Chair 
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TOWARDS THE SIMULATION OF PATHOLOGICAL VOICE QUALITIES 


‘Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles, 


S. Ben Elhadj Fraj', F. Grenez!, J. Schoentgen'” 


50, Avenue F. D. Roosevelt, 1050 Brussels, Belgium 
¿National Fund for Scientific Research, Belgium 


Abstract: The presentation concerns a synthesizer for 
disordered voices. The simulation of dysphonia is a 
topic the relevance of which is growing, but to which 
few studies have been devoted. The simulator that is 
discussed here involves a nonlinear memory-less 
model of the glottal area that is driven by a harmonic 
excitation the instantaneous frequency and amplitude 
of which are controlled. The glottal airflow rate is 
generated by means of an aerodynamic model of the 
glottis, which also comprises trachea-source and 
source-tract interactions as well as the generation of 
turbulence noise at the glottis. Trachea and vocal tract 
are modelled by means of a concatenation of lossy 
cylindrical pipes of identical length, but different 
cross-sections. The text concerns the presentation of 
the synthesizer and its synthetic output, as well as the 
results of a perceptual evaluation of the naturalness of 
simulated speech sounds in the framework of a stimuli 
comparison paradigm. 


I. INTRODUCTION 


The presentation concerns a synthesizer for 
pathological voices. Motivations for developing 
simulators of disordered voices are the discovery of 
speech cues that are relevant to the perception of 
abnormal voices; the preparation of reference stimuli in 
the framework of the perceptual assessment of disordered 
voices; the training of speech therapists in the auditory 
evaluation of dysphonic speakers; as well as the testing of 
the reliability or validity of acoustic cues of disordered 
speech. 

Earlier attempts have often involved conventional 
formant synthesizers driven by a concatenated-curve 
model of the glottal excitation, such as the well-known 
Liljencrants-Fant model [11]. Problems with 
concatenated-curve models of the glottal excitation are 
that they are prone to aliasing because their bandwidth is 
unknown a priori and perturbations of the glottal cycle 
lengths have to be synchronized with the cycle onset, 
which is a physiologically unlikely assumption. Also, 
tract-source interaction and generation of additive noise 
rely on ad hoc assumptions. 

We have presented earlier a glottal source model 
based on nonlinear shaping functions [10][14]. This 
model enables directly controlling the bandwidth as well 
as instantaneous frequency of the glottal source signal and 
the perturbations thereof. Source-tract interaction as well 


as the generation of additive noise have to be simulated 
heuristically, however. 

Here, we therefore present a synthesizer that involves 
models of the glottal area and airflow through the glottis. 
Instead of the glottal source signal, the time-evolving 
glottal area is modelled by means of a nonlinear memory- 
less signal model that transforms a trigonometric driving 
function into the desired glottal area waveform. One 
attractive property of the model is that the instantaneous 
frequency and harmonic richness of the glottal area are 
controlled by the instantaneous frequency and amplitude 
of the harmonic driving function [1]. 

The glottal airflow rate is generated by means of an 
aerodynamic model, which includes interactions between 
the glottis and the infra- and supra-glottal ducts [2]. The 
propagation of the acoustic wave through the trachea and 
vocal tract is simulated by means of concatenated tubes. 
Wall vibration, viscous and thermal losses as well as 
acoustic reflection and radiation at the lips and glottis are 
taken into account. 

Random modulation noise such as jitter or tremor and 
abnormal voice qualities such as diplophonia, biphonation 
and irregular vocal cycles are mimicked by means of 
stochastic or deterministic models of the time-evolving 
instantaneous frequency of the driving harmonic of the 
glottal area model [10]. The text focuses on the 
presentation of the model and its synthetic output, as well 
as on preliminary results of a perceptual evaluation of the 
naturalness of simulated vowel categories. 


II. MODELS 
The block diagram of the synthesizer is presented in 
Fig.l. Symbols f, bs, fı and b; designate forward and 
backward components of the acoustic pressure wave 
propagating in the infra- and supra-glottal tracts. 
Fig.1: Block diagram of the synthesizer 


cosinusoid 


lung pressure 


subglottal tract 


glottal area 
template 


Fourier 
coefficients 


polynomial 
coefficients 


aerodynamic glottal area 
model of the shaping function 
glottal airflow model 


speech signal 


Models and analysis of vocal emissions for biomedical applications: 5th international workshop: 


December 13-15, 2007: Firenze, Italy, ed. by C. Manfredi, 
ISBN 978 88-8453-673-3 (print) ISBN 978-88-8453-674-7 (online) 
© Firenze university press, 2007. 


A. Nonlinear memory-less glottal area model 


A reason for opting for a shaping function model of the 
glottal area is that such a model gives explicit control over 
the instantaneous frequency of the glottal area and over its 
shape, which may evolve smoothly from a constant to the 
template area via a quasi-sinusoid. 

The glottal area is assumed to evolve symmetrically 
during glottal opening and closing. The open glottis 
template is therefore a half-cycle cosinesoid positioned 
symmetrically about the time origin. The half-cycle is 
padded to the left and nght with zeros to model glottal 
closure. Combined zero and hemi-cosinesoidal curves 
form the glottal area template, which is an even time 
function. The greyish blocks in Fig./ summarise 
operations that are carried out once. They obtain the 
shaping polynomial on the base of the glottal area 
template. The polynomial coefficients are indeed 
calculated from the Fourier series coefficients of the 
template by means of a constant linear transform [1]. 

The polynomial shaping function per se forms the 
glottal area model, which outputs the template exactly 
when the model is driven by a cosine the amplitude of 
which is equal to unity and the period of which is equal to 
the length of the area template. 

The instantaneous length of an area cycle and its 
spectral slope and amplitude are controlled via the 
instantaneous frequency and amplitude of the driving 
cosine to output areas that differ from the template. 


Fig.2: Glottal area (above) and glottal airflow 
waveform (below) 
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B. Aerodynamic model of the glottis and interactive 
source-filter coupling 


Assuming continuity at the glottal boundaries, Titze 
has derived an algebraic equation for the dependence of 
the airflow rate on glottal area, incident components of 
the infra- and supra-glottal acoustic pressure waves, 
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sound speed and density of air [2]. Both epi-laryngeal and 
sub-glottal pressures, which are the pressures downstream 
and upstream of the glottis, are expressed as the sum of 
forward and backward propagating components, which 
are obtained by a temporal simulation of the wave 
propagation in the sub-glottal and supra-glottal tracts [2]. 

In Fig.2, one sees the area outputted by the polynomial 
shaping function model and the waveform outputted by 
Titze’s glottal flow model. One notices oscillatory ripples 
in the glottal flow waveform owing to source-tract 
interaction. 


C. Trachea and vocal tract models 


C.1. Lossless model 

In agreement with the Kelly-Lochbaum model of 
wave propagation, trachea and vocal tract are mimicked 
by means of a concatenation of cylindrical pipes of 
identical lengths, but different cross-sections. In the 
lossless case, the reflection coefficient at the lips equals 
-1 and the reflection coefficient at the glottis +1. That is, 
no acoustic energy is transmitted to the outside. 

The wave propagation can be simulated digitally if the 
time step between samples is chosen to be the time 
interval the acoustic wave takes to travel the length of 
one pipe. Here, the sampling frequency equals 88.2 kHz. 


C.2. Lossy model 

Several models have been proposed to simulate losses 
in the framework of wave propagation models. To 
simulate wall vibration losses, an auxiliary (transversal) 
tube is inserted at each junction between adjacent pipes 
[3]. Acoustic losses due to friction at the tract walls and 
heat losses through the walls are modelled according to 
[7]. 

To mimic acoustic reflection and emission at the 
glottis, proposals by Flanagan [4] or Badin & Fant [5] 
have been implemented. Both models give results that are 
auditorily equivalent. 

Several lip radiation models have been investigated, 
each model using a reflection coefficient that depends on 
frequency [4][6]. As an alternative, a conical tubelet, the 
opening of which is controlled, has been connected at the 
lip end of the vocal tract to simulate the transition from 1- 
dimensional to 3-dimensional wave propagation. 
Informal listening tests have suggested retaining the lip 
radiation model proposed by Flanagan [4]. 

Tracheal losses are taken into account via a real 
attenuation coefficient at the lung end. The numerical 
value of this coefficient has been investigated on the base 
of perceptual experiments that are reported hereafter. 


C.3. Vowel area functions 

Six French vowel categories, [a], [i], [u], [o], [e] and 
[e], have been synthesized. The pipe cross-sections have 
been fixed on the base of published data. The area 
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functions of [a] and [i] have been recovered from [12] 
and the area functions of [u], [o], [e] and [e] from [13]. 
The latter have been interpolated to increase the spatial 
resolution from 1 cm to 0.396825 cm. 

The length of the tracheal tube has been equal to 
14.2857 cm and its cross-section equal to 1.2 cm’. 


D. Models of vocal perturbations 


Additive noise owing to turbulence is mimicked by 
means of a model proposed by Titze [2]. Vocal jitter and 
frequency tremor are simulated via diffusion models of 
the phase of the cosine that is driving the glottal area 
model [8]. Vocal amplitude shimmer and amplitude 
tremor arise passively in the glottal and tract models via 
modulation distortion [9]. Deterministically varying 
glottal cycle lengths are used to simulate diplophonia and 
biphonation and stochastically fluctuating glottal cycle 
lengths are used to simulate random cycles [10]. 


HI. AUDITORY EVALUATION 
A. Vowel category identification 
The first experiment concerns vowel category 
identification. The objective is to test whether human 
listeners are able to identify the six synthetic target vowel 


categories [a], [1], [u], [o], [e] and [e]. 


Tab.1: Auditory vowel category identification in percent 


identified as: 
[a] | [i] | [u] | [o] [el | [e] | other 
la] |92|0 011 7 

© [i] 0 | 93 1/0 
g [u] o|o]|81|8|o|0 11 

(©) 

®| lo] o|o|7|78|o0|o0| 15 
S| tel 0 | 0.) 75171 17 
[e] o ololo 25|53| 22 


The open quotient of the glottis and the lung 
reflection coefficient are among the parameters that 
influence perceived naturalness of vowel timbre. 
Therefore, each vowel category has been synthesized 
with glottal open quotients equal to 50, 62 or 83 % and 
lung reflection coefficients equal to 0.2, 0.5 or 0.8 giving 
a total of nine vocal timbres per category. The source 
fundamental frequency has been equal to 100 Hz. The 
glottal cycles have been perturbation-free. The lengths of 
the synthetic stimuli have been | second. 


Eight French-speaking judges have listened to the 54 
realizations in an arbitrary order. They have been asked 
to recognize and identify the vowel category by ticking 
one item in a list of 11 different monosyllabic French 
words. Each word has been [pV] or [pVR], with V an oral 
French vowel. The word list has been completed by an 
“indefinite” bin. 


B. Auditory evaluation of voice source timbres 


The objective of the second experiment has been to 
determine the preferred voice timbre within each vowel 
category. The experiment has been carried out within the 
framework of a stimuli comparison paradigm. The pair- 
wise comparison paradigm has the advantage that 
untrained judges are able to rank voice timbres without 
having to assign explicitly scores to stimuli according to 
perceived quality dimensions (e.g. naturalness, clarity, 
brilliance, etc.), which are difficult to define and on which 
even professional judges may not agree. 


Tab.2: Vocal timbre average ranks per vowel category 


Pulmonary reflection 

coefficient & glottal | [a] | [i] | [u] | [o] | [el | [e] 

open quotient (%) 
0,2 - 50 6,2 | 5,1 | 1,9 | 7,0 | 5,3 | 4,4 
0,5 - 50 6,4 | 4,3 | 2,6 | 5,8 | 5,8 | 3,8 
0,8 - 50 5,6 | 3,7 | 1,9 | 5,8 | 5,8 | 2,8 
0,2 - 62 4,9 | 5,7 | 3,2 | 4,5 | 4,4 | 3,6 
0,5 - 62 5,0 | 4,9 | 4,9 | 3,3 | 4,9 | 3,8 
0,8 - 62 4,5 | 5,7 | 6,8 | 1,1 | 4,9 | 5,5 
0,2 - 83 1,3 | 3,4 | 6,3 | 2,1 | 0,6 | 4,0 
0,5 - 83 1,1 | 2,6 | 4,5 | 3,0 | 1,9 | 4,4 
0,8 - 83 1,1 | 0,6 | 3,8 | 3,5 | 2,3 | 3,6 


The 9 timbres for each vowel category have been 
presented pair-wise, that is a total of 9 x 8/2=36 pairs per 
category. The judges have been informed of the target 
category by means of a monosyllabic French word 
comprising the target as a nucleus. 

The judges have been asked to indicate their preferred 
timbre within each pair by clicking on a button that is part 
of a user interface. The listeners can also select an “equal 
preference” button when they consider that the voice 
qualities of both stimuli are equivalent. They have the 
opportunity to listen to each member of a pair as often as 
they wish. 

For each pair, a software that handles stimuli 
presentation assigns to the preferred stimulus the score 1 
and to the other the score 0. When both are considered 


equal, each is assigned the score 0.5. Once all 36 
comparisons have been carried out, the stimuli are ranked 
according to their scores. An average rank of 8 would 
mean that all the judges have preferred this timbre every 
time it has been presented. An average rank of 4 would 
mean that this timbre has been preferred as often as it has 
been disfavoured and a rank of 0 would mean that the 
listeners have always preferred the other stimuli in the 
pair. 


IV. RESULTS AND DISCUSSION 


Table 1 shows the percentage of identification of the 
synthetic target categories with the reference categories 
(11 French vowels & one “indefinite” bin). Because of 
lack of space, only identifications with reference 
categories are reported that also are target categories. 
Other misidentifications are pooled under the heading 
“other”. One sees that confusions exist between vowels 
that differ in the degree of aperture, e.g. vowel [a] has 
been identified once as [e] and vowel [i] once as [e]. 
Confusions between [o] and [u] appear to be balanced, 
that is, roughly as many [o] are identified as [u] and [u] 
are identified as [o]. The pair [e][e] is unbalanced, 
however, category [e] has been identified far more often 
as [e] than [e] as [e]. Given that the vowel sounds are 
sustained and presented in isolation, the misidentification 
with neighboring categories of similar aperture are 
expected. 

The right-most column in Table 1 reports the 
percentage of identifications of the target categories with 
the remaining reference categories. The total percentage 
of misidentification suggests that extreme vowels [a], [i] 
and [u] are less likely to be misidentified than vowels [o], 
[e] and [e], which is plausible because the former have 
less neighbours they may be misidentified with. 

Table 2 reports the ranked preference of within- 
category timbres for the vowel categories that have been 
reported in Table 1. A principal component analysis has 
been carried out on the vowel categories. The results 
show that two principal components explain 89% of the 
total variance after rotation. Categories [a], [i] and [e] are 
strongly correlated (> 0.5) with the first component. 
Categories [u] and [e] are strongly correlated with the 
second component and category [o] is strongly negatively 
correlated with the second component. 

Category [o] is therefore unique. Inspection of Table 
2 shows that [o] is the only category that is strongly 
preferred when the voice timbre is characterized by an 
open quotient of 50% and disfavored when characterized 
by an open quotient of 62%. Categories [u] and [e] on the 
contrary, are strongly preferred when the open quotient 
equals 62% and disfavored when it equals 50%. Principal 
component 2 therefore captures the antagonist behavior 
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of vowel categories [u], [o] and [e] with regard to the 
open quotients of 50% and 62%. 

Timbres of categories [a], [i] and [e] are strongly or 
moderately preferred when the open quotients equal 50% 
or 62% respectively. 

Principal components analysis therefore shows that 
category [o] behaves differently from all the other 
categories for which an open quotient of 62% is either 
strongly or moderately preferred. 

The analysis also shows that the pulmonary reflection 
coefficient has no major influence on listener preference. 
Auditory tests confirm that higher reflection coefficient 
gives rise to timbres that are perceived as more brilliant. 
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A MATHEMATICAL MODEL FOR ACCURATE MEASUREMENT OF 
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Abstract: Jitter is a fundamental metric of voice 
quality. The majority of jitter estimators produce an 
average value over a duration of several pitch peri- 
ods. This paper proposes a method for short-time jitter 
measurement, based on a mathematical model which 
describes the coupling of two periodic phenomena. The 
movement of one of the two periodic phenomena with 
respect to the other is what is considered as jitter and 
what the proposed method measures. Through tests 
with synthetic jitter signals it has been verified that 
the suggested method provides accurate local estimates 
of jitter. Further evaluation was conducted on ac- 
tual normal and pathological voice signals from the 
Massachusetts Eye and Ear Infirmary (MEEI) Disor- 
dered Voice Database. Compared with corresponding 
parameters from the Multi-Dimension Voice Program 
(MDVP) and the Praat system, the proposed method 
outperformed both in normal vs. pathological voice 
discrimination. 

Keywords: Jitter, short-time, pathological voice. 


I. INTRODUCTION 


Evaluation of voice quality is an essential diagnostic aid 
for the assessment of pathological voice. Methods based 
on acoustic analysis have several advantages. In compar- 
ison to methods such as videoendoscopy or electroglot- 
tography (EGG), they cost less, require less time and are 
non-invasive for the patient. Furthermore, acoustic analy- 
sis can produce automatic quantitive results, which, apart 
from assisting clinical doctors, can be exploited for unsu- 
pervised classification of a voice as pathological or nor- 
mal, or even detect specific cases of dysphonia. 

The main effect of a pathological condition, as we per- 
ceive it, is noise. The parameters produced by acoustic 
analysis for voice quality, usually quantify the presence of 
this aperiodic component; mainly additive noise, such as 
in cases of breathiness, or modulation noise, such as in 
cases of roughness. Further regarding modulation noise, 
this can be detected either in frequency, called jitter, or in 
amplitude, called shimmer. Jitter is defined as perturba- 
tions of the glottal source signal that occur during vowel 
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phonation and affect the glottal pitch period. The mea- 
surement of jitter can be performed by using the radiated 
speech signal, or by using measurements of glottal con- 
ductivity through (EGG). The computation may take place 
in the time domain, in the frequency domain (magnitude 
spectrum), or using cepstrum. 

Several methods have been proposed for the computa- 
tion of quantitative values for jitter. Time domain methods 
are usually based on pitch period measurements that are 
used to estimate an average value of jitter, over a number 
of several periods. If N is the total number of pitch peri- 
ods and u(n) is the pitch period sequence, the definitions 
of widely accepted jitter measurements are given below. 
Local jitter is the period-to-period variability of pitch (%) 


wit nai lm +1) — u(n)| 
$ n=1 u(n) 


N &N 
Absolute jitter is the period-to-period variability of pitch 
in time 


(1) 


1 N<1 
Nod Y lun+1) — u(m)] (2) 


n=1 
Relative Average Perturbation (RAP) jitter provides the 
variability of pitch with a smoothing factor of 3 periods 
(%) 


1 S- 2 |2u(n+1)—u(n)—u(n+2)| 
N-2 n=1 3 


n=1 
x Ly un) 
Pitch Period Perturbation Quotient (PPQ) provides the 


variability of pitch with a smoothing factor of 5 periods 
(%) 


(3) 


1 xN-4 |4u(n+2)-u(n)-u(n+1)-u(n+3)-u(n+4)| 
N-4 ini 5 
n=1 
N 2y un) 
(4) 


The pitfall with such techniques is that they heavily rely on 
a periodicity that doesn’t actually exist in speech, while 
some methods specifically provide a jitter value that is a 
percentage of that notion of periodicity. In order to over- 
come this problem (the existence of non-periodicity), a 
standard solution is to perform a low pass filtering before 
pitch estimation, which solution essentially destroys the 
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details of the speech signal; it reduces the effect of non- 
periodicity, which is however what we would like to mea- 
sure. 


An alternative to calculating an average value for jit- 
ter, is that of short-time tracking. A sequence of jitter 
values on small intervals can be more precise without as- 
suming long-term periodicity and may even provide bet- 
ter insight on the evolution of pathological voices. In this 
work we suggest the use of a mathematical model that 
enables us to combine two periodical phenomena, in or- 
der to achieve the local aperiodicity. Based on that, we 
identify jitter as the movement of one of the two peri- 
odic phenomena with respect to the other. This move- 
ment is exactly what we try to measure. Using such a 
model we are able to calculate the value of short-time jitter 
with high precision. Comparison was made with the corre- 
sponding jitter measurements provided by PRAAT [1] and 
Multi-Dimensional Voice Program (MDVP) [2] of Kay- 
Pentax, on the database Massachusetts Eye and Ear Infir- 
mary (MEEI) Disordered Voice Database [3]. 


The paper is organized as follows. In section II 
we present the mathematical model we propose and the 
method we derived from it to measure short-time values 
of jitter. The conducted experiments and their results are 
presented in section III. Section IV concludes the paper. 


P-e Pte P- 


amplitude 


0 P-e 2P 
time (samples) 


3P-e 4P 


Figure 1: 
model. 


Glottal impulse train of the proposed jitter 


II. METHOD 


Jitter may be expressed as a perturbation on the glottal 
excitation impulse train. A simple mathematical model 
can be obtained by considering a cyclic perturbation, with 
pitch deviation of a constant value, applied every second 
impulse [4]. The glottal impulse train can be expressed 
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Figure 2: Power spectrum of the harmonic and subhar- 
monic parts. It is worth to note that crossings between the 
two parts, reveal the value of jitter. 


then as 


+00 
pin]= >> din (2k) P]+ 
k=- 


+00 
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(5) 


where P is the pitch period and e is the pitch deviation, 
both in samples. This model, shown in Fig. 1, realizes 
the combination of two periodic phenomena and e is the 
movement that corresponds to the local aperiodicity of jit- 
ter and therefore the value we should seek to measure. The 
value of e can range from 0 (no jitter) to P (pitch halving). 

The power spectrum of the impulse train can be shown 
to be 


|P(w)/? = : 
Le oF 27 

= 2(1+cos[(e- P)w]) | Y spol -k335) 
Cda T? 2m 

= 2(1 + cos [(e — P)w]) 5 pôw -17)+ 
1=— 00,k=21 

Ss T? 2T T 
ae > psp) 


I=—00,k=21+1 
The last part can be written as 


|P(w)? = H(e,w) + S(e, w) (6) 


where H(e, w) is the influenced by jitter harmonic part of 
the power spectrum, while S(e, w) is the subharmonic part 
that appears because of the jitter. 

The two power spectra for various values of e are de- 
picted in Fig. 2. We observe that the harmonic and sub- 
harmonic parts for a certain value of € crossover that many 
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times. The structure remains the same also on the output 
from a linear system when the input is the impulse train 
pin). 

Based on this perceived structure of power spectra a 
short-time jitter estimator has been developed. Initially, 
for a given speech signal, a pitch estimation takes place 
that provides us with a temporal sequence of the pitch pe- 
riod. A sliding frame is used to allow us to examine the 
signal gradually in time. The size of the frame can be 
either fixed to 4 times the average pitch period, or vari- 
able to 4 times the local pitch period. The frame step used 
is accordingly either one average pitch period, or one lo- 
cal pitch period. A hanning window is then applied to 
the frame and the power spectrum is computed. The size 
of the Discrete Fourier Transform is that of the small- 
est power of 2 that is closest to the length of the frame. 
From the power spectrum, the harmonic and subharmonic 
parts are taken into account, and by counting the number 
of crossings between them, the jitter value of the current 
frame is estimated. In order to overcome potential spec- 
trum resolution problems, a threshold is used to determine 
if a crossing has occurred. If the harmonic and subhar- 
monic parts, after a candidate crossing, never reach a dif- 
ference over the threshold value, before the next potential 
crossing, then it is not regarded as one. Through testing, 
the threshold value has been set to 3dB. In the end a short- 
time jitter sequence with integer values (i.e. in samples) 
is obtained. Taking into account the sampling frequency 
of the signal the sequence is converted to usec. It is evi- 
dent, that the larger the sampling frequency, the larger the 
resolution of the measurement. 


II. EXPERIMENTS AND RESULTS 


In order to verify the validity of the proposed method, in 
theory and in practice, experiments were carried out with 
both synthetic and actual pathological voice signals. The 
actual signals were taken from the MEEI Disordered Voice 
Database [3]. 


A. Synthetic Signals 


The synthetic signals were created using glottal impulse 
trains as described in (5). These were used to excite an 
AR model of order 50, extracted from a sustained record- 
ing of vowel /a/, with an average fundamental frequency 
of 125Hz. This was done for sampling frequencies of 16 
and 48kHz, and for e values from 0 to 10% of each pitch 
period. The duration of the signals were set to 1sec. 
Using a fixed frame size, with knowledge of the actual 
pitch period, we did confirm our observations. The struc- 
ture of the glottal excitation was maintained on the final 
signal and exact measurement of the short-time jitter was 
possible. Fig. 3 shows the power spectrum of a frame 
of the synthetic signal, with sampling frequency 48kHz 
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Figure 3: In the experiments with synthetic signals, the 
proposed method measures exactly the local jitter value. 


and e = 5. The crossings counted correspond to the jit- 
ter movement, while two false crossings are correctly re- 
jected. 

To verify our results, we used as a reference the 
Praat [1] system. The absolute jitter (2) measurement as it 
is implemented in Praat [Jitter (local, absolute)] was used. 
Since our method calculates a sequence of short-time val- 
ues, we used for comparison the average jitter value. Hav- 
ing in mind Fig. 1, absolute jitter (2) would return a jitter 
value of 2 x e, while the average value we measure is 1 x e. 
To do an analogous comparison we use double the average 
jitter value. 

The error difference between the actual jitter value and 
the results of our method and Praat are presented in Fig. 4, 
for 16 and 48kHz. The proposed method has zero error, 
while the error difference of Praat is of the order of some 
seconds for all e values, except three cases in the 48kHz, 
where Praat determined the signals as unvoiced and didn’t 
return jitter measurements. 


B. MEEI Disordered Voice Database 


The MEEI Disordered Voice Database contains sustained 
vowel and reading text samples, from 53 subjects with nor- 
mal voice and 657 subjects with a wide variety of patho- 
logical conditions. Also included for most of the signals 
were the acoustic analysis parameters produced by the 
Multi-Dimensional Voice Program (MDVP) [2]. For the 
purpose of our experiments, all 53 of the normal voice 
samples and 632 of the pathological voice samples were 
used, and specifically the sustained recordings of vowel 
/al. The excluded pathological voice samples were the 
ones that lacked the MDVP parameters. The sampling 
frequency of the selected signals were originally either 25 
or 50kHz, with the normal voice ones only of 50kHz. To 
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Figure 4: The minimal error difference of Praat verifies 
the results of the proposed method, which has zero error. 


avoid potential correlation of the results with sampling fre- 
quency, all signals used in this paper were resampled to 
25kHz. 

For the pitch estimation required by the proposed 
method, YIN [5] was used with default parameters. Both 
fixed and variable frame size experiments took place. The 
computed short-term jitter sequence, for each sample, was 
averaged and doubled, to provide a single jitter measure- 
ment. The MDVP Jita parameter and the Praat Jitter (local, 
absolute) value, both implementations of (2), were used 
for comparison. Receiver Operating Characteristic (ROC) 
curves for the four measurements in contest are portrayed 
in Fig. 5. The proposed method outperforms in discrim- 
ination, between pathological and normal samples, both 
MDVP and Praat, with the fixed frame case having slightly 
better results over the variable frame case. 


IV. CONCLUSION 


We proposed a method for short-time jitter evaluation, 
based on a mathematical model of two periodic phenom- 
ena. The experiments conducted with synthetic signals 
verified that the method produces accurate local estimates 
of jitter. Regarding pathological voice classification, it 
was shown that the average value of the proposed method 
is more discriminant than two standard implementations 
of absolute jitter (2) measurement, namely MDVP and 
Praat. 

The fact that the proposed method allows us to see the 
behavior of local jitter in time, is something that we plan to 
examine in depth. Knowledge of the gradual development 
of jitter, apart from being of use in voice quality evalua- 
tion, it may also be useful in automatic pathological con- 
dition discovery. 

Jitter also contributes to the appearance of noise in the 
spectrum. This presents problems regarding the compu- 
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Figure 5: ROC curves for four jitter estimators, using sam- 
ples from the MEEI database. The proposed method, us- 
ing fixed size frame, is the most discriminant. 


tation of a Harmonics to Noise Ratio (HNR) estimate. 
Identifying in the magnitude spectrum the noise induced 
by jitter, may provide a more accurate HNR value [6]. 
The points where the crossings between the harmonic and 
subharmonic parts of the power spectrum occur, are good 
candidates for deciding which parts of the spectrum noise 
should not be considered additive but structural. 
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Abstract: An attempt to analyse formant 
movements within the cycle of vocal fold vi- 
bration by linear prediction is presented. To 
achieve a high temporal resolution the dura- 
tion of the analysis window is adjusted shorter 
than the fundamental period. The measure- 
ment noise increased thereby is counteracted 
by averaging the parameter estimates by clus- 
tering. Preprocessing by a linear band pass 
filter and by polynomial regression is dis- 
cussed. The function of the method is demon- 
strated by analysing a synthetic formant with 
known resonance parameter contours. Then 
the method is applied to the first and sec- 
ond formant of the vowels [i:], [a:], and [u:] 
of a male speaker. The instantaneous formant 
and bandwidth contours and compared to the 
electroglottographic recorded contours of vo- 
cal fold tissue contact. An interpretation in 
terms of time varying acoustic coupling of the 
subglottal cavity through the larynx is pro- 
posed. 

Keywords: subglottal coupling, linear predic- 
tion, electroglottography 


I. INTRODUCTION 


Parameterization and modeling of voice quality is the 
field of research of this work. Two measurement tech- 
niques are the starting point for the actual develope- 
ment: electroglottography and the harmonic spec- 
trum of the sound. 

The time domain contour of electroglottography 
serves as a reference to the physiological kinematic 
of phonation. It shows the degree of tissue contact in 
the larynx immediately. Unfortunately it gives only 
limited information of the acoustic excitation of the 
vocal tract. 

Based on the observations of [2], spectral estimates 
that correlate with voice quality parameters like open 
quotient and glottal opening were developed. This 
method uses analysis windows that include at least 


two pitch periods in order to show the spectral peaks 
of the harmonic signal structure. This long term anal- 
ysis causes a certain noise immunity but the follow- 
ing three features are considered as disadvantages: (i) 
The amplitude of the fundamental oscillation is used 
as a reference point. But this low frequency compo- 
nent is not essential for speech preception and is not 
transferred to the listener over the telephone. (ii) The 
amplitude transfer function of the vocal tract has to 
be estimated and compensated for, which proofed to 
be very complicated in practice. (iii) Rough voice and 
other voice quality phenomena that deviate funda- 
mentally from the periodic structure are theoretically 
and practically not well covered by the measurements 
on the harmonic spectrum. 

Therefore a different analysis method is proposed 
that measures at the formants in spectral regions 
where the speech signal is most prominent and most 
relevant for speech perception. It also uses the acous- 
tic recording and tries to observe the modulation of 
the formant parameters center frequency and band- 
width due to the movements of the vocal folds. 

A further alternative technique to derive the acous- 
tic excitation of the vocal tract is inverse filtering. In 
particular the application of time variant inverse fil- 
ters seems promising but is clearly beyond the scope 
of this study. 


II. METHOD 


During voiced phonation the opening and closing vo- 
cal folds imprint time variation on the acoustic pa- 
rameters of the vocal tract. In particular, changes to 
the frequencies and bandwidths of the resonances, i.e. 
formants, were predicted [3, pp.299]. The time scale 
of these temporal changes to the formant frequencies 
and bandwidths is the duration of a single pitch cycle 
and shorter, i.e. typically 5-10 milliseconds. 

The standard tool for formant analysis, linear pre- 
diction, usually is applied to segments of e.g. 25ms 
that containing more than a single pitch cycle. With 
such a long analysis window changes that take place 
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within a pitch cycle are not resolved and only con- 
tribute on an average to the frequency and bandwidth 
estimates of the formants. Using a shorter analysis 
window together with pre- and postprocessing, regu- 
lar modulations of the frequency and the bandwidth 
of the first formant within each pitch cycle can be 
displayed. They seem to reflect physical events that 
are visible in the electroglottogram (EGG) like glot- 
tal closure and opening. Furthermore, the parameter 
modulations seem to display the appropriate changes 
where the EGG does not show changes in tissue con- 
tact, e.g. in phases of incomplete closure [1]. 


A. Signal conditioning 


A standard first-order difference filter is applied for 
preemphasis of the formants over the low frequency 
components of voiced excitation. The zero of the filter 
is located on the real axis at 0.99. 

The speech components with higher frequencies 
than the intended first two formants are supressed 
by a lowpass filter with a cut-off frequency of 2.5kHz. 

In [1] polynomial regression is used to reduce the 
segment of the excitation waveform within the ac- 
tual analysis window before the correlation sequence 
is computed. Alternatively a best matching constant, 
straight line or parabola is subtracted from the in- 
put signal. The polynomial degree is selected after 
visual inspection of the resulting formant parameter 
contour. For female speech and for other than modal 
voice quality the formant parameter contours become 
more regular in some cases. 

A high pass filter with a cut off frequency above 
the fundamental frequency of the voice and below the 
center frequency of first formant is used here. It has a 
much more clear influence on the signal, seems to pro- 
duce more stable results and eliminates the selection 
of the polynomial degree. 

Combining the lowpass and the highpass filter char- 
acteristic results in a bandpass with a passband from 
275Hz to 2.5kHz. A 400 point FIR filter designed 
with a kaiser window and a minimum stop band sup- 
pression of about 60dB is used. 


B. Formant parameters 


After preprocessing an estimate of the autocorrela- 
tion sequence is transformed to the linear prediction 
polynomial by the Levinson-Durbin algorithm. A 
typical analysis window duration of 200 points cor- 
responds to 4ms at the used sampling rate of 48kHz. 
The order of the linear prediction polynomial is se- 
lected to be 49 which corresponds roughly to a pole 
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per kilohertz of the total bandwidth of the digitized 
signal and one pole on the real axis. 

The roots of this polynomial are extracted as the 
eigenvalues of the polynomials companion matrix. 
Angle and radius of each root is mapped to its fre- 
quency and bandwidth. These frequencies and band- 
widths are stored togehter with the center time of the 
current analysis window. These raw parameter esti- 
mates scatter broadly around the time varying pa- 
rameters of the resonator. To reduce the noise in 
these parameter estimates by averaging or clustering, 
a large number of analysis frames is used by moving 
the analysis window only 10 points or 0.2ms. 


C. Noise reduction 


The raw parameter estimates are processed to the 
frequency and bandwidth estimates of the first and 
second formant by averaging or clustering. First a 
frequency interval is selected. Currently this is done 
after visual inspection of either the scatter plot of the 
raw frequency estimates or a wide band spectrogram. 

The following averaging method is used for the first 
formant in [1]. Every 5 subsequent raw frequency esti- 
mates within the frequency interval are averaged and 
plotted as the formant estimate contour. The accom- 
paning raw bandwidth estimates are limited to a max- 
imum of 600Hz to limit outliers, averaged and plotted 
as the bandwidth estimate contour. Unfortunately 
the window duration of this smoothing technique de- 
pends on the density of the raw estimates and varies 
around Ims. Either there are short time intervals 
without any estimate in the considered frequency in- 
terval and the smoothing extended over these ‘holes’. 
Or there are more than one poles at the same instant 
and the 5 point averaging ends before lms. 

To proceed to a more automated averaging tech- 
nique, k-means clustering is used here. Now a wider 
frequency interval is specified. For short isolated vow- 
els the broad interval of [300Hz,2200Hz] and 4 initial 
clusters resulted in good estimates of the first two for- 
mants. If front to back vowel movements are analysed 
it is better to specify the range for each formant sep- 
arately. In time direction an interval of 4ms duration 
is located at the beginning of the speech segment and 
moved in 2ms steps through the signal. At every po- 
sition of the time frequency window the k-means clus- 
tering is started. To find the first and second formant 
the clustering algorithm seeks for clusters, starting at 
the resulting center frequencies of the last frame. The 
initial cluster centers are either the centers of the fre- 
quency intervals selected manually, or the resonances 
of an one sided open tube at 500 + n x 1000Hz are 
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used, following a simple model of an [e] or [e] sound. 


D. Synthesized formant 


To learn to what extent this analysis method is able 
to display short term variations of formant parame- 
ters, a single formant with a prescribed formant and 
bandwidth contour is synthesized. The changing pa- 
rameters are implemented by a time varying recursive 
filter of second order. The filter is excited by an im- 
pulse every time the contour of the center frequency 
reaches its maximum. Frequency and bandwidth are 
changed along a sinusoidal contour that starts with 
a period of 12ms and is accelerated to a period of 
8ms within 50ms. The center frequency of the ar- 
tificial formant is moved between 500Hz and 600Hz. 
The bandwidth is moved between 100Hz and 200Hz 
inversely phased with the center frequency. The pa- 
rameter contours are shown in the upper track of Fig. 
1 and Fig. 2. 


E. Speech material 


The vowels [ix], [az], and [uz] of a male speaker with 
normal pitch and with modal phonation are recorded 
with a sampling rate of 48ksps and with 16 bits linear 
amplitude resolution. The acoustic signal is trans- 
duced with an AKG CK 62-ULS condenser micro- 
phone connected to an AKG C 460 B preamplifier. 
The EGG signal is measured with a laryngograph 
model Lx Proc type PCLX from Laryngograph LTD. 
The recording room is anechoic with a reverberation 
time of 27 milliseconds. 


III. RESULTS 


A. Synthesized formant 


The synthesized signal is band pass filtered with a 
passband between 275Hz and 825Hz. No preempha- 
sis is applied since no source spectrum needs to be 
corrected in this simplified, impulse shaped source 
signal. 

Fig. 1 shows the contour of the center frequency 
of the synthesized time varying formant and the re- 
sult of the frequency estimation algorithm of Sec. II.. 
Accordingly the filter bandwidth and its analysis is 
shown in Fig. 2. 

The cluster centers within the frequency range of 
275Hz and 825Hz are taken as frequency estimates. 
The contour of the frequency estimate follows the con- 
tour of the artificial resonance. There is a positive 
bias of about 30Hz. 
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Figure 1: Center frequency of an artificial formant 
and its estimate. 
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Figure 2: Inversely phased bandwidth contour and its 
falsely phased estimate. 
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Fig. 2 shows an estimated bandwidth contour that 
is erroneously inversely phased to the filter bandwidth. 
The minima approach the original 100Hz but the 
maxima go up to 400Hz doubling the maximum filter 
bandwidth. 


To further investigate the falsely phased bandwidth 
estimate the filter bandwidth contour is put in phase 
with the frequency contour. The bandwidth estimate 
of this modified filter output has slightly modified 
amplitudes but does not change its phase. Rather it 
seems to resemble the frequency contour. This ob- 
servation is a drawback in the interpretation of the 
bandwidth contour that needs further to be investi- 
gated. 
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B. Vowels 


The first two formants are considered. The preem- 
phasis filter is used and the band pass filter has a 
passband between 275Hz and 2.5kHz. The duration 
of the analysis window is 4 milliseconds. 


Within the stable regions of each vowel 50ms are lo- 
cated in a scatter plot of the raw frequency estimates. 
The frequency intervals around each first and second 
formant are identified in the same diagram. The re- 
sult of the clustering algorithm Sec. II.C is shown in 
Figs. 3 - 5. 
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Figure 3: Vowel [ax] 


Due to the acoustic wave propagation all frequency 
and bandwidth contours are delayed about 2ms with 
respect to the electroglottographic contour. Every 
formant parameter contour shown is modulated by 
the vocal fold cycle to a certain extent. The closed vo- 
cal folds increase the first formant of all vowels shown. 
The second formant is increased after a short delay 
in [az] and [ix] and decreased in [u:]. The bandwidth 
of the first formant of [az] and [i] is decreased by the 
closed vocal folds and increased in [ui]. 


IV. CONCLUSION 


Frequency and bandwidth of the first and the fre- 
quency of the second formant are analysed by linear 
prediction with a short window and show rapid move- 
ments that could be caused by the opening and clos- 
ing of the vocal folds. Clustering of the raw param- 
eter estimates is demonstrated to be an appropriate 
smoothing technique. However, analysing the output 
of a synthetic time varying filter showed no influence 
of the filter bandwidth to the estimated bandwidth. 
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Abstract: Acoustic analysis of voice is potentially useful 
for objective assessment and characterization of voice 
disorders. However, before extracting acoustic measures 
of voice it is firstly pertinent to ask; what do we mean by 
voice? In describing voice, the perceptual impression 
formed by the listener or the physical characteristics of 
the production mechanism may be of primary interest. 
With this in mind specific correlations with perception 
and source production are worthy of attention. The voiced 
speech signal recorded using a microphone comprises a 
glottal source signal, which has been resonated and 
radiated. Hence this signal is only indirectly related to the 
underlying source production mechanism. Furthermore it 
is only indirectly related to the perception of voice quality 
because auditory processing is not considered. Indices 
commonly extracted from the acoustic speech waveform 
include the harmonics-to-noise ratio (HNR), jitter and 
shimmer. This presentation inquires into how these 
measures relate to physical’and perceptual 
characterizations and into how progress on these issues 
may be advanced. 


Keywords : Acoustic analysis, harmonics-to-noise ratio 
I. INTRODUCTION 


Acoustic analysis of voice signals potentially provides an 
attractive mechanism for rating voice quality and for 
assessing the state of the larynx non-invasively, or even 
remotely. A number of commercial acoustic analysis 
systems are presently available for use in voice clinics. 
Although these systems may be helpful, at least for 
documentation purposes (e.g. objective monitoring of 
pre-/post- op, over the course of therapy etc.), a number 
of problems persist that seem to have prevented acoustic 
analysis techniques making a much greater impact on 
voice assessment and rehabilitation. 


Acoustic analysis of voice refers to signal processing of 
the microphone recorded voice signal. Early studies of 
spectrographic [1] and sonographic [2] displays revealed 
the presence of excessive noise and cycle length and 
cycle amplitude perturbations when comparing 
pathological voices to normal voices. In order to quantify 
these waveform variations indices termed harmonics-to- 
noise ratio (HNR), jitter and shimmer were introduced. 


(c.f. [3]) 


II. THEORY 


The harmonics-to-noise ratio (HNR) is defined as the 
ratio of the periodic component to aperiodic component 
in voiced speech. 


T 
MY Sve IY 
HNR(s)=1010g;0 i (1) 
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HNR(s) indicates the harmonics-to-noise ratio of the 
voiced speech waveform, s. M is the total number of 
fundamental periods, i is the i fundamental period (of 
length T) and Says is the waveform averaged over M 
fundamental periods. 


From the above definition it can be inferred that the ratio 
is sensitive to all forms of signal aperiodicity though it is 
often considered to reflect a measure of signal (or 
harmonic) energy to turbulent noise energy at the glottis. 


Jitter is a measure of the temporal variation in glottal 
cycle length from cycle to cycle. Shimmer reflects the 
variation of peak amplitude in a glottal cycle from cycle 
to cycle. It is interesting to note that these indices are 
defined for the speech waveform although it is generally 
glottal source characteristics that are inferred. Some 
consequences of source/filter theory in inferring source 
changes as measured from the speech waveform have 
been highlighted recently [4]. 


As these indices are extracted from the voiced speech 
signal it is not immediately clear how the measures relate 
to the physical state of the vocal folds or to the perception 
of voice quality. Let us consider HNR as a specific 
example. 


The harmonics-to-noise ratio (HNR(g)) of the glottal 
waveform is defined as 
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MY Sang IY 
HNR(g)= 10108, — — (0) 
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HNR(g)= 10log;o| 3% —— (3) 
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In comparing Eq.(1) and Eq,(3) it can be inferred that 
HNR(s) # HNR(g) (4) 


Eq.(1) represents the harmonics-to-noise ratio of the 
voiced speech signal, while Eq.(3) represents the 
harmonics-to-noise ratio at the glottis. Although HNR(g) 
is of more interest (for physical correlations at least) it is 
HNR(s) that is measured in reality because it is the 
microphone recorded voiced speech signal (s) that is 
generally available. 


Writing s, the voiced speech waveform in terms of g, the 
glottal waveform, n, glottal noise, v, the vocal tract and r, 
the radiation load, for the i'” period allows for a more 
detailed comparison of Eq.(1) and Eq.(3) 


s=(gtn)*v;*r; (5) 


For convenience v; and r; can be represented as vr; to 
incorporate the combined effect of vocal tract filtering 
and radiation at the lips. As the segment of interest is 
considered to result from a quasi-stationary process the 
filtering and radiation effects can simply be represented 
as vr (in reality small fluctuations in vr will lead to 
increased aperiodicity). Hence period i can be represented 
as 


s=(gi+n))*vr (6) 


Hence the average voiced speech waveform can be 
written as 


Via ¡FA Je vr Va ria * VI 


gog E 7 
og > 7) 


as n; is random noise the second term disappears to give 


M 
vg *vr 


Save = = gog evr (8) 


similarly the variance can be written as 
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M 


M 
X llei +1,)*vr— gag vr} => (n, *vr) (9) 


i=l i=l 


Hence HNR(s) can be written as 


M (oag * vr} 


HNR(s)= => (10) 


Dm wp 


i=l 


Comparing Eq.10 with Eq.3 it can be seen that vr is 
retained within the calculation of HNR(s). Hence HNR(s) 
does not tell us does directly about HNR(g). Viewing this 
problem from the frequency domain facilitates an 
alternative calculation that allows for the removal of vr 
from the resulting harmonics-to-noise measure. 


Given that the periodic voiced speech waveform, s can be 
represented as 
s=(g+n)*vr (11) 
where g is a periodic glottal pulse and n is glottal noise, 
the corresponding frequency domain representation is: 
Si(Gr+Ny)xViRx (12) 


where Sx, Gy, Nx, V and R are the Fourier transforms of 
their corresponding time-domain functions and k is the 
frequency index. The corresponding HNR for voiced 
speech can be shown to be 


M/2 
M125 [Gang VR| 


HNR(S) = SE (13) 


SNM Rel 


k=1 


Taking an alternative summation allows VR to be 
removed from the calculation. 


2 (Gag VR| 2 M2 (Caval 
HNR(G)' = 25 -—) 
Se N VR| ME IN, |? 
(14) 


to provide a glottal source related HNR, HNR(G)’. In 
general this ratio is not equal to the HNR(g) (Eq.3) as 
rather than summing the signal energy and dividing by 
the summed noise energy, the harmonics-to-noise ratio at 
each frequency point k is estimated and an average of 
these ratios is determined. G, is non-zero at harmonic 
locations and Ny is estimated at between-harmonic 
locations. HNR(G)” as calculated above, is, to a first 
approximation, independent of the influence of the vocal 
tract. 
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Eq.14 employs the spectrum of voiced speech to extract 
an index related to the signal-to-noise of the glottal 
source. Alternative strategies are required in an attempt to 
extract measures related to the perception of voice 
quality. In a speech coding context frequency weighting 
has been employed to match aspects of the auditory 
processing mechanism [5]. A basic auditory perceptual 
harmonics-to-noise ratio (HNR(A)) is given as 


MA (15) 


where |S;|} and |Npl? represent the signal and noise 
energies, respectively, in frequency band b, and w, 
represents the frequency weighting for band, b. The 
frequency bands are spaced in accordance with the 
critical bands of the ear [5]. Although such measures 
have been employed in quality assessment of speech 
coding and transmission systems, to date, they do not 
appear to have been employed specifically for voice 
quality assessment. 


III. METHOD 
A. Synthesis 


The vowel a/ is synthesized using an implementation of a 
discrete time model for speech production with the 
Rosenberg glottal flow pulse [6] used as the source 
function. A sequence of these pulses are used as input 
into a delay line digital filter [7], where the filter 
coefficients are obtained based on area function data for 
the Russian vowel a/ as given by Fant [8]. Radiation at 
the lips is, modeled by the first order difference equation 
R(z)=1-z . Random noise is introduced to the glottal 
pulse via a random noise generator arranged to give noise 
of a user specified variance (4%, 8% and 16% s.d.). 
Signals are created for these three levels of additive noise 
for frequencies beginning at 80 Hz and increasing in six, 
approximately equi-spaced steps of 60 Hz up to 350 Hz. 
A sampling frequency of 10 kHz is used throughout. 


B. Analysis 


The present method employs a spectral based HNR 
estimation technique similar to the one described in [9]. 
A Hamming window length of 2048, padded up to 4096 
and hopped by 1024 points providing 8 individual 
spectral estimates for about 1.2 seconds of speech is used 
in the spectral estimation process. The harmonic energy 
estimates are obtained by summing the points within the 
mainlobe width (8x4096/2/2048), while noise estimates 
are calculated by summing the energy between harmonic 
mainlobes. The measures, HNR(S) — harmonics-to-noise 
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ratio of the voiced speech signal and glottal related HNR 
(HNR(G)”) are extracted from the speech spectra. An 
auditory perceptual HNR (HNR(A)) is not examined in 
the present study. 


HI. RESULTS 


HNR(S) is plotted against fundamental frequency in 
Fig.1. It is observed that as f0 increases HNR(S) 
increases in a nonlinear fashion, for equal noise levels of 
the glottal source. 
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Fig.1 The harmonics-to-noise ratio of the speech signal, 
HNR(s) versus fundamental frequency (f0) for three 
levels of glottal noise, 4%, 8% and 16%. 


In contrast to Fig.1, HNR(G) does not change as f0 
changes, i.e. 4%, 8% and 16% glottal noise gives rise to 
HNR(g)’s of 28 dB, 22 dB and 16 dB respectively, for all 
values of f0, as expected. However in practice G is 
typically not available for analysis. Fig.2 shows 
HNR(G);s plotted for the frequency range from 1 to 5 
kHz (i.e. the 0-1 kHz region is excluded from the 
calculation). The variation in HNR(G);s is similar to the 
variation of HNR(S). Fig.3 shows the variation of the 
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Fig.2 The bandlimited (1-5 kHz) harmonics-to-noise ratio 
of the glottal signal, HNR(G);s versus fundamental 
frequency (f0), for three levels of glottal noise, 4%, 8% 
and 16%. 
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glottal source related HNR (Eq.14), HNR(G)’ versus f0 
for the same noise levels. The response to noise at a given 
f0 is approximately linear while the fO variation is greatly 
reduced. However the measure still increases slightly as 
f0 increases. 
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Fig.3 The glottal source related harmonics-to-noise ratio, 
HNR(G)’ versus fundamental frequency (f0), for three 
levels of glottal noise, 4%, 8% and 16%. 


IV. DISCUSSION 


The harmonics-to-noise ratio of the voiced speech signal, 
HNR(S) is f0 dependent for equal levels of glottal noise. 
This is a consequence of source-filter theory as a given 
filter characteristic is excited at different frequencies as 
f0 differs. Consider two glottal signals that are scaled 
versions of each other (100 Hz and 200 Hz), each with a 
fall-off of approximately —12dB/octave. The 100 Hz 
pulse reaches a level of—60 dB at 1600 Hz while the 200 
Hz pulse does not reach this level until 3200 Hz. Hence 
the higher frequency signal has larger amplitude 
harmonics (though less densely packed) further up the 
frequency range in comparison to the 100 Hz signal. 
Plotting the bandlimited harmonics-to-noise ratio of the 
glottal signal, HNR(G),; helps to illustrate this point. A 
trend very similar to HNR(S) versus f0 is observed. The 
higher frequency glottal signals have higher energy in the 
high frequency region. 

From an analysis viewpoint, the variation of HNR(S) 
versus f0 is problematic if we wish to infer a measure of 
glottal signal-to-noise. If the glottal signal has a certain % 
noise it is desirable to measure the corresponding signal- 
to-noise ratio independent of f0. HNR(G)’ provides an 
estimate of the glottal signal-to-noise status which is 
approximately independent of the influence of the vocal 
tract. As shown in Fig.3 HNR(G)” significantly reduces 
this f0 dependence. Some variation with fO remains — this 
can be reduced (for the same reason outlined above 
regarding scaled glottal signals) by limiting the 
calculation to a set number of harmonic/between- 
harmonic locations as opposed to employing a set 
frequency range. 
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V. CONCLUSION AND FUTURE WORK 


The harmonics-to-noise ratio of the speech signal 
(HNR(S)) is f0 dependent. This is problematic if HNR(S) 
is to be used to infer information regarding the glottal 
flow signal-to-noise ratio or to distinguish between 
patient and normal data sets. An alternative, glottal 
source related HNR, HNR(G)’ was introduced to provide 
a HNR measure that is largely f0 independent. It is 
postulated that limiting this ratio to a set number of 
harmonics will remove most of the remaining f0 variation 
of the measure. 

New methods of HNR estimation are also required to 
provide measures that are more relevant perceptual 
viewpoint. Eq.15 defines an auditory perceptual HNR, 
HNR(A). It should be interesting to correlate the 
perception of jitter, shimmer and noise with HNR(A), 
taking into consideration the spectral characterization of 
these aperiodicites [10]. 
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Abstract : The inter-rater variability in 
perceptual voice evaluation still limits the 
widespread clinical use of the best available 
rating system. Support of visible speech in 
experimental conditions demonstrates a 
significant enhancement of the inter-rater 
agreement. However it does not influence the 
correlation between perceptual and 
conventional acoustic parameters. The 
addition of visible speech to the clinical setting 
is feasible since nowadays affordable 
computer programs provide the spectrogram 
in quasi real time. 

Keywords : Dysphonia, Perceptual evaluation, 
Visible Speech, Acoustic analysis. 


I. INTRODUCTION 


The GIRBAS scale introduced by Hirano [1] has 
become a commonly used scale for perceptually 
rating severity of deviance in voice quality . 
However judgments of different raters (even 
experienced) might differ considerably [2;3]. 
Acoustical analysis of pathological voice has 
several advantages as being quantitative and non- 
invasive, and cost and time efficient. As a 
disadvantage, most acoustical analysis relies on 
quasi-periodic waveforms and thus cannot be 
used on noisy and irregular voices. Further, 
because of the lack of one-to-one relations 
between acoustical and perceptual voice 
parameters, the perceptual assessment cannot be 
replaced by acoustical analysis. 

Sound spectrograms are another approach : the 
spectrogram enables visualization of speech 
(therefore sometimes being referred to as visible 
speech) and has been widely used in voice and 
speech research. It was also clinically applied to 


evaluate voice [4]. The present study examined 
whether adding visible speech would enhance the 
interrater agreement of perceptual ratings of 
pathological voices. Since spectrograms reveal 
acoustical properties which are related to 
parameters as jitter and noise-to-harmonic ratio, 
it is conceivable that visible speech increases the 
correlations between acoustical and perceptual 
parameters. Therefore, also the effect of visible 
speech on these correlations was examined in 
this study. 


II. MATERIALS AND METHODS 


Pathological voices 

Seventy pathological voices of all kinds of 
etiologies were digitally recorded. The recorded 
voice tracks consisted of a prolonged /a/ with a 
duration of several seconds and a spoken 
sentence in Dutch. 


Visible speech 

The visible speech consisted of two spectrograms 
(0 — 4000 Hz) of the sustained /a/ : One 
spectrogram is produced with a fine frequency 
resolution (bandwidth: 59 Hz) showing 
harmonics and the other with a fine time 
resolution (bandwidth: 300 Hz) showing glottal 
pulses. 


Perceptual evaluation 

Six experienced raters independently evaluated 
the voice samples (prolonged /a/ and sentence, 
all on a CD) in two sessions with an interval of 
4-10 months between the sessions. During the 
second evaluation session the accessory visible 
speech of the sustained /a/ was presented to the 


Models and analysis of vocal emissions for biomedical applications: 5th international workshop: 


December 13-15, 2007: Firenze, Italy, ed. by C. Manfredi, 
ISBN 978 88-8453-673-3 (print) ISBN 978-88-8453-674-7 (online) 
© Firenze university press, 2007. 


22 


experts simultaneously with the acoustic 
presentation of the voice samples (prolonged /a/ 
and sentence). 


Acoustic evaluation 

A variety of acoustic parameters as was 
calculated using the multidimensional voice 
program (MDVP, Kay Elemetrics Corp.) on a 
relatively stationary part of the prolonged /a/. 


Agreement 

Agreement between perceptual evaluations of 
two experts can be estimated using the parameter 
kappa (x) introduced by Cohen. Cohen’s kappa 
corrects for agreement by chance. To assess the 
agreement among the six raters we computed 
kappa according to Fleiss [5] who extended 
Cohen’s kappa for more than two raters. 

To determine whether the agreements found in 
the conventional and visible-speech conditions 
significantly differ, the two kappa values were 
statistically tested. 

Acoustical versus perceptual parameters 

Since the perceptual parameters (G, I, R, B, A 
and S) are ordinal, the correlation between the 
acoustic and perceptual evaluations is calculated 
using the Spearman rank correlation coefficient. 


III. RESULTS 


Interrater agreement 

The ratings of the 70 voices were used to 
calculate « for six raters. The agreement between 
ratings was significantly higher with than 
without visible speech for the perceptual 
parameters G, R and B (Fig. 1). 


Acoustical and perceptual parameters 

We correlated the perceptual parameters G, I, R, 
B, A and S with various acoustical parameters 
for ratings with and without visible speech. No 
significant changes in correlation were found. 
Fig. 2 shows the effect for jitter and shimmer. 
We investigated the effect of different selection 
windows on the correlations between acoustical 
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and perceptual parameters. We compared the 
entire vowel including onset ramp and offset 
damp, the standard window, and a fixed-duration 
(1 s) window 250 ms to 1250 ms after onset. 
These different selection windows did not 
produce different results on the correlations 
between acoustical and perceptual parameters. 


IV. DISCUSSION 


Our study produced two pronounced results. 
First, the interrater agreement was clearly larger 
with than without visible speech (information as 
provided in spectrograms of the voice track) for 
rating grade, breathiness and roughness. Second, 
visible speech had no effect on the correlations 
of the GIRBAS ratings with acoustical 
parameters. 

The addition of visible speech to the clinical 
setting is feasible since affordable computer 
programs can provide it in quasi-real-time. 
Hence, the enhancement of the interrater 
agreement is an important finding. 

No systematic shifts have been found with the 
addition of visible speech: on the average, G 
increased whereas B decreased, and R did not 
shift. Considering the wide distribution of 
ratings, the ratings with visible speech seem to 
distinguish well between various voices. 

Our results confirm the notion that perceptual 
rating cannot be replaced by acoustical 
parameters at least as produced by MDVP 
paradigms. Perceptual and acoustic measures can 
be considered complementary. Hence, an 
optimal evaluation of voice quality is achieved 
according to a multidimensional protocol, 
including acoustic and perceptual measures [6;7]. 


V. CONCLUSION 


Support of visible speech demonstrates a 
significant enhancement of the inter-rater 
agreement in perceptual voice evaluation. It does 
not influence the correlation between perceptual 
and conventional acoustic parameters. 
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Figure 1. Kappa for 6 raters for G, I, R, B, A and S without and with visible speech. White bars reflect 
kappa values without visible speech, black bars reflect kappa values with visible speech. Significant 
differences between kappa with and without visible speech are * : p < 0.05, ** : p < 0.001. 
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Figure 2. Correlations between perceptual parameters grade, roughness and breathiness with the acoustic 
parameters jitter and shimmer. White bars reflect correlation coefficients without visible speech, black bars 


reflect correlation coefficients with visible speech. Differences were not significant (p>0.05). 
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Abstract: Most of the vocal and voice diseases cause 
changes in the acoustic voice signal. Acoustic analysis 
is a useful tool to diagnose this kind of diseases, 
furthermore it presents several advantages: it is a 
non-invasive tool, an objective diagnostic and, also, it 
can be used for the evaluation of surgical and 
pharmacological treatments and rehabilitation 
processes. Most of the approaches found in the 
literature address the automatic detection of voice 
impairments from speech by using the sustained 
phonation of vowels. In this paper it is proposed a new 
scheme for the detection of voice impairments from 
text dependent running speech. The proposed 
methodology is based on the segmentation of speech 
into voiced and non voiced frames, parameterising 
each frame with mel frequency cepstral parameters. 
The classification is carried out using a discriminative 
approach based on a Multilayer Perceptron Network. 
The data used to train the system were taken from the 
voice disorders database distributed by Kay 
Elemetrics. The material used for training and testing 
contains the running speech corresponding to the well 
known “rainbow passage” of 226 patients (53 normal 
and 173 pathological). The results obtained are 
compared with those using sustained vowels. The text- 
dependent running speech showed a light 
improvement in the accuracy of the detection. 


Keywords: running speech, pathological voices, mel 
cepstral parameters, multilayer perceptron 


I. INTRODUCTION 


Current panorama of acoustic analysis allows us to 
calculate a great amount of measurements of long term 
acoustic parameters. Such parameters (fo, jitter, shimmer, 
Harmonics to Noise Ratio (HNR), Normalized Noise 
Energy (NNE), Voice Turbulence Index (VTD, Glottal to 
Noise Excitation Ratio (GNE), Signal to Noise Ratio 
(SNR), Frequency Amplitude Tremor (FATR), etc. [1]) 
were developed to measure quality and “degree of 
normality” of voice registers from the sustained 
phonation of vowels. However, some of these parameters 


are based on an accurate estimation of the fundamental 
frequency, a rather complicate task in the presence of 
certain pathologies On the other hand, there are other 
works in the literature using short time features for the 
detection of voice impairments from the sustained 
phonation of vowels. Some of them address the automatic 
detection of voice impairments from the excitation 
waveform collected with a laryngograph [2] or extracted 
from the acoustic data by inverse filtering [3]. However, 
due to the fact that inverse filtering is based on the 
assumption of a linear model, such methods do not 
behave well when pathology is present due to non- 
linearities introduced by pathology in itself. Other authors 
have proposed also nonlinear signal processing for the 
same task [4]. On the other hand, there are authors that 
obtained good results addressing the detection of voice 
impairments from running speech using different 
techniques [5;6]. 

In this paper we are presenting an alternative approach 
for the detection of voice disorders using text dependent 
running speech comparing the results with those obtained 
using sustained vowels. It is well known that, regarding 
the evaluation of the voice quality and the presence of 
pathologies, the running speech contains much more 
information than the sustained phonation of vowels. This 
is why the widely used perceptual GRBAS scale (Grade 
of dysphonia, Roughness, Breathiness, Asthenicity, and 
Strainess) [15] is usually evaluated by the specialists 
using running speech. 

The preliminary results obtained in this work showed a 
light improvement in the accuracy of the detection using 
text dependent running speech rather than the sustained 
phonation of vowels. 

The paper is organized as follows: Section II gives an 
overview of the methodology used in this study. Section 
3 contains the results obtained. And finally, Section 4 
presents a short discussion and the conclusions. 


II. METHODOS 


The acoustic samples used for this work are registers 
from patients with normal voices and a wide variety of 
organic, neurological, and traumatic voice disorders. 
These pathologies reveal themselves either as a 
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modification of the excitation organ morphology (i.e. the 
vocal folds) or in a variation of the normal vibration 
pattern of the vocal folds, which may result in the 
increment of mass or rigidity of certain organs, thus 
resulting in a different pattern of vibration altering the 
periodicity (bimodal vibration), reducing higher modes of 
vibration (mucosal wave), and introducing more turbulent 
components in the voice record. Within this group the 
following pathologies can be enumerated among others: 
polyps, nodules, paralysis, cysts, sulcus, edemas, 
carcinomas, etc. 

The speech samples used in this work were collected 
by the Massachusetts Eye and Ear Infirmary (MEEI) 
Voice and Speech Labs [11] in a controlled environment 
using a condenser microphone placed at 30 cm from the 
mouth. The registers stored in this database were 
recorded with different sampling frequencies (from 50 to 
10 kHz) and 16 bits of resolution. This database contains 
the sustained phonation of the vowel /ah/ and recordings 
of the “rainbow passage”. This text-dependent passage 
has been widely used in speech therapy to evaluate the 
quality of the speech from the perceptual point of view. 
Previous to the study, the files stored in the database were 
low pass filtered and their sampling rate was adjusted to 
be 25 and 10 kHz respectively for the sustained vowels 
and the text-dependent recordings. A subset of 171 
pathological and 53 normal speakers has been taken 
according to those enumerated by Parsa et al. in [5]. 

Voice registers were framed using 50% overlapped and 
40 ms long Hanning windows. Every frame is pre- 
processed in order to avoid unvoiced segments and 
parameterized to reduce the dimensionality and 
complexity of the detector (Fig. 1) 

The parameterization is carried out using a non- 
parametric approach capable of modeling the effects of 
pathologies on both the excitation (vocal folds) and the 
system (vocal tract). The parametric approach used is that 
known as FFT based Mel-frequency Cepstral Coefficients 
(MFCC) [8]. Such method is based on the human 
perception system establishing a logarithmic relationship 
between the real frequency (Hz) and perceptual frequency 
(mels). It performs the cosine transform over the 
logarithm of the energy, calculated from frequency bands 
that are bandwidth dependent on the central frequency of 
each filter. An improved representation can be obtained 
extending the analysis to include in-formation about the 
speed and time evolution of those parameters calculated. 
First (A) and second (AA) derivatives [9] were included 
joining the feature vector, allowing to time-delocalize the 
analysis. The calculation of A and AA was carried out by 
means of anti-symmetric moving-average Finite Impulse 
Response (FIR) filters to avoid phase distortion of the 
temporal sequence (length 9 for A, and 3 for AA). 

The segmentation of voiced and unvoiced frames was 
carried out with a voiced-unvoiced detector based on the 
techniques reported in [10]. 
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Fig. 1 shows the scheme used for the feature extraction 
and classification. The modeling is addressed by means 
of a Multilayer Perceptron (MLP) neural network using 
the non-parametric short-term MFCCs. Every vector of 
parameters is used to feed a three layered MLP [7] with 
100 hidden neurons and two output nodes characterized 
by a logistic activation function. The input layer has as 
many inputs as MFCC parameters. Learning is carried out 
by backpropagation algorithm with momentum. It is well 
known that the output of each output node of such 
structure in a two-class problem may be interpreted as the 
likelihood that the input pattern belongs to each class. So, 
each speaker is characterized with the same number of 
vectors as voiced frames extracted from each record. For 
each frame, both the likelihood to be normal and the 
likelihood to be pathological are calculated as a result of 
the score assigned to each output node. An index, (called 
likelihood ratio or log-likelihood ratio) is obtained 
subtracting the log-likelihood (likelihood in the 
logarithmic domain) to be normal, from the log- 
likelihood to be pathological. The decision about 
normality or abnormality is taken establishing a threshold 
over the normalized likelihood ratio. 

The scores given by detectors for normal and 
pathological voices were used to plot the true and false 
score curves. Decisions about presence or absence of 
pathology are taken establishing a decision boundary that 
ensures the minimum classification error. Fig. 2 shows 
the problem of finding an optimum decision threshold 
that corresponds to the point where the distributions of 
both classes is equal is called Equal Error Rate (EER), 
and usually it is considered as an optimum point for the 
decision. However, the EER point might not be the best 
threshold due to the scatter of the density functions; in 
such a case, a new decision threshold is needed. Under 
these conditions, the threshold that corresponds to the 
minimum average error rate is called Minimum Cost 
Point (MCP). According to the Bayes decision theory, 
this point might be calculated by taking into account the 
difference in the risk of the two possible errors (false 
acceptance or false positive and false rejection or false 
negative). 


Parameterization 
„| Hanning 
windows 
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detector / removal 


Fig. 1: Scheme used for the feature extraction and 
classification 


For the evaluation of the accuracy of the system, we 
have adopted a cross-validation scheme, namely the 
bootstrap method to assess the generalization of the 


Pathology detection/classification I 


model. Each experiment is repeated N times, with a 
different test set, randomly chosen from the whole set of 
files. The final results are averaged across these 
repetitions, and confidence intervals are computed using 
the standard deviation of the measures. We repeated the 
experiment 11 times, combining the files detailed in the 
training and test sets randomly. The accuracy of the 
system was calculated by cross validation of the results. 
For each run, the available data files were divided into 
two subsets: 70% to train the system, 30% to validate 
results. The number of voice samples from the database 
was 234 (53 normal and 171 pathological voices) 
according to the criteria found in [12]. 

The final results are presented through confusion 
matrices, where we define the next measures: true 
positive rate (tp), also called sensitivity, is the ratio 
between pathological files correctly classified and the 
total number of pathological voices; false negative rate 
(fn), that is the ratio between pathological files wrongly 
classified and the total number of pathological files; true 
negative rate (tn), sometimes called specificity, is the 
ratio between normal files correctly classified and the 
total number of normal files; false positive rate (fp), that 
is the ratio between normal files wrongly classified and 
the total number of normal files. The final accuracy of the 
system is the ratio between all the hits obtained by the 
system and the total number of files. 


Threshold s 


Probabili y dif ributon incor 
Ty - x = 
| il Em P 
I 


nai 
ELE 


| 
I 
Thur ataki | 


iii 


10 fi 5 1 sl 0 2 4 6 B 10 

Fig. 2. Probability distribution functions for both 
classes. The dashed lines correspond to the Minimum 
Cost Point and the Equal Error Rate. 


Throughout this work, the measurements enumerated 
were calculated using the EER threshold. The scores are 
compared to the EER threshold value in order to compute 
the confusion matrix. If we move this threshold we obtain 
a set of possible operating points for the system, which 
can be represented through a Detector Error Tradeoff 
(DET) plot [13], widely used in speaker verification. In 
this plot, the false positives are plotted against the false 
negatives, for different threshold values. In the DET 
curve we plot error rates on both axes, giving uniform 
treatment to both types of error, and using a scale for both 
axes which spreads out the plot and better distinguishes 
different well performing systems and usually produces 
plots that are close to linear. Another choice is to 
represent the false positives in terms of the true positives 
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in a Receiver Operating Characteristic (ROC) [14]. ROC 
displays the diagnostic accuracy expressed in terms of 
sensitivity against (l-specificity) at all possible threshold 
values in a convenient way. The ROC used to be 
characterized and complemented using its Area Under the 
Curve (AUC) [14]. 


HI. RESULTS 


We repeated the experiment 11 times, combining the 
files in the training and test sets randomly. Table 1 shows 
the mean and standard deviation values of the confusion 
matrix. The total accuracy of the system is 95.9%+2.8. 

Fig. 3 shows the DET and ROC plots that summarize 
obtained results. The DET plot in Fig. 3b shows the 
overall performance of the detector. Moreover, the ROC 
plot in Fig. 3a along with the AUC shows an idea of the 
overall performance of the detector. The DET and ROC 
were drawn averaging the scores obtained with the 11 test 
sets. 


ROC curve. AUC = 0.997 
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Fig. 3: Performance curves of the detector. a) ROC 
plot using text-dependent running speech; b) DET plot 
for the text-dependent running speech and for the 
sustained vowels corpus 


It can be noticed that using the same parameterization 
and classification approaches the performance with text- 
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dependent running speech lightly improved the results 
with respect to those obtained with sustained vowels. 


Table 1: Results of the classification (in %) (mean + 
standard deviation) using text-dependent running speech. 


Actual diagnosis 
Pathological Normal 
Detector’s Pathological tp=97.1+3.4 fp=10+8.9 
decision Normal fn=2.9+3.4 tn=9048.9 


IV. DISCUSSION AND CONCLUSIONS 


This work has presented a methodology to 
automatically detect voice pathologies based on a MLP 
detector and short term MFCC features using text 
dependent running speech and comparing the results with 
those obtained with the sustained phonation of the /ah/ 
vowel. It is well known that, regarding the quality of 
voice and the presence of pathologies, the running speech 
contains much more information than the sustained 
phonation of vowels. So, as expected, the results showed 
and improvement of the accuracy of the detection using 
text-dependent running speech. These results match very 
well with the fact that the perceptual evaluation usually 
made by otolaryngologists or speech therapists use to be 
based on running speech rather than on sustained vowels. 

On the other hand, the MFCC parameters had been 
previously used for laryngeal pathology detection with 
sustained vowels, and they had demonstrated a good 
performance, surpassing other short time features like 
linear prediction based measurements. However, they had 
never been used with running speech. This preliminary 
work demonstrated that short-term MFCC revealed to be 
a good parameterization approach also for the detection 
of voice impairments using text-dependent running 
speech. So the proposed detection scheme may be used 
for laryngeal pathology detection with efficiency around 
96%. 

The current study opens up the way to extend this 
methods for classification tasks between different 
disorders, perceptual vocal qualities (e.g.: hoarseness, 
breathiness, etc), or the categorization of the speech 
registers into different degrees of impairment, such as the 
GRBAS scale. 
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Abstract: Voice is the result of the coordination of the 
whole pneumophonoarticulatory apparatus. The 
analysis of the voice allows the identification of the 
diseases of the vocal apparatus and currently is 
carried out from an expert doctor through methods 
based on the auditory analysis. The paper presents a 
web-based system for the acquisition and automatic 
analysis of vocal signals. Vocal signals are submitted 
by the users through a simple web-interface and are 
analyzed in real-time by using state-of-the art signal 
processing techniques, providing first-level 
information on possible voice alterations. The system 
offers different analysis functions to the doctors that 
may analyze suspected cases in detail. The system is 
currently being tested in the otorhinolaryngologist 
setting to carry out mass prevention via screening at a 
regional scale. 

Keywords : Voice Analysis, Otorhinolaryngology 


I. INTRODUCTION 


Voice is the result of a complex mechanism involving 
different organs of the pneumophonoarticulatory 
apparatus. In particular, it is the result of the vibration of 
the upper part of the mucosa covering the vocal cords. 
Such vibration determines the production of a sound, the 
larynx-fundamental tone, that is enriched by a set of 
harmonicas, generated by the resonance cavities in the 
upper part of the larynx. Any modification of this system 
may cause a qualitative and/or quantitative alteration of 
the voice, defined as dysphonia. Dysphonia can be due to 
both organic factors (organic dysphonia) and other factors 
(dysfunctional dysphonia). 

Dysphonia is one of the major symptoms of benign 
laryngeal diseases, such as polyps or nodules, but it is 
often the first symptom of neoplastic diseases such as 
laryngeal cancer as well. Spectral "noise" is strictly 
linked to air flow turbulences in the vocal tract, mainly 
due to irregular vocal folds vibration and/or closure, 
causing dysphonia. Such symptom requires a set of 
endoscopic analysis (by using videolaryngoscope, VLS) 
for accurate analysis. 


However, clinical experience has pointed out that 
dysphonia is often underestimated by patients, and 
sometimes even by family doctors. As widely reported in 
literature [1, 2], an early detected glottis tumour (T1, T2 
stadium) can be solved in 100 % of cases with surgical 
intervention. Thus, the screening of voice alteration is 
extremely important in larynx diseases. 

Several experiences of using algorithmic approaches 
for the automatic analysis of signals exist. Software tools 
(commercial and freely available) allow manipulating 
voice components in an efficient way (e.g. WinPitch!, 
VOICEBOX?) and permits specialists to manipulate and 
analyze voice signals. Many automatic systems are based 
on voice signal processing whereas others combine signal 
processing with machine learning and data mining 
algorithms. The problem is that most of them are usable 
only locally and none of them offers remote collection 
and analysis as well as storing in central data bases for 
further use. The system described in [3] is one of the few 
remote data analysis systems. The problem is that voice is 
loaded by using telephone standard, which is known 
having low signal quality that decreases quality of 
classification. 

However, in our knowledge, no systems of remote 
screening is available, that allows setting up a data base 
of voice signals, at the same time giving disabled patients 
a simple test for voice screening, without the need of 
moving to the laboratory. 

The paper presents the architecture and the first 
implementation of REVA (Remote Voice Analysis), a 
web based system for the acquisition and automatic 
analysis of vocal signals. The system consists of a client 
module where a user, after registration is driven into a 
test phase where voice signal is registered, after a 
verification of the minimum hardware requirements. The 
voice signal, cleaned from noises, is sent through the 
Internet to the remote server which is in charge of 
analyzing it; the server will return to the client the signal 
analysis results and the possible voice anomalies will be 
related to potential diseases. After testing in the 
University of Catanzaro Hospital, the system will be 


| http:/Avww.winpitch.com/ 
2 http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 
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finalized for diagnostics in the otorhinolaryngologist 
setting, in particular to carry out mass prevention via 
screening at a regional/national scale. 

The rest of the paper is organized as follows. Section 2 
describes the system architecture. Section 3 presents the 
first prototype implementation. Section 4 points out the 
benefits and the Section 5 concludes the paper and 
sketches future work. 


II. SYSTEM ARCHITECTURE 


The REVA system employs a client/server architecture 
deployed as a web based application. 


Fig. 1 REVA Architeture 


The main modules of the system are shown in Fig. 1: 

l. The Presentation Module represents the web 
interface between the system and the user. It is 
used to allow interaction with both the final users 
and the doctors. It contains the system description 
and the disease description. Its main tasks consist 
of giving the instructions for the system use and 
returning the analysis result to the user. Moreover, 
a specialized interface for the doctors is also 
provided. 

2. The Data Acquisition Module is in charge of 
managing user personal data. After data is 
collected, the user is guided to the voice recording 
phase. 

3. The Vocal Data Registration Module acquires the 
vocal samples and, after checking whether they 
are suitable for the analysis, sends the audio files 
to the server. 

4. The Vocal Data Analysis Module, after the audio 
file has been received, extracts key signal 
parameters and performs the analysis for 
classification. It returns the result to the 
Administrator module. 

de The Data Administrator Module saves the data in 
the database and generates the response for the 
Presentation Module. The response is also sent by 
e-mail. 

6. The Database Module contains data acquired 
through client interface. Voice signals are stored 
both in the raw data format as well as in a 
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preprocessed format, where the main parameters 
related to the signal are stored. 


HI. PROTOTYPE 


The REVA system has been implemented by using the 
Java technology. In particular the client is implemented 
through a Java Applet, while the server functions have 
been implemented by using the Java Server Pages. The 
database is implemented by using the open source 
relational MySQL DBMS. 

In the following a brief description of the system 
functionalities is provided, both from the client and server 
sides. 


A. Client module 


The minimal system requirements for the remote user 
consist of a PC with Internet connection, web browser, 
audio card and microphone. The user visits the web site 
where the remote diagnostic service is available. The 
main page of the site provides a detailed description 
about the service offered, the scopes and effectiveness of 
the service itself. 

To make a test, the user accesses the registration page 
entering his/her data and other information useful for the 
diagnosis. When the registration phase is completed, the 
user enters the testing phase, accesses a new page (see 
Fig. 2), which drives him/her through the acquisition of 
vocal samples. 


Fig. 2 Patient view: voice recording. 


The file containing the audio registration is analysed 
on the client, for a preprocessing phase (for instance, to 
exclude empty, inconsistent or too long files or to reduce 
noise). If the registration is validated, the audio file is 
sent to the server through the Internet. The result of the 
signal analysis is transmitted to the user both via a new 
webpage and via e-mail. 

Note that patient's personal (name, surname, etc.) and 
clinical data are collected into an XML document that is 
sent to the server together with the audio file. Metadata 
are stored with audio files into the database such that for 
those patients periodically accessing to the service, the 
medical specialist and the system are able to monitor and 
analyze the voice signals. 
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B. Server module 


The server hosts a listener process waiting for the 
connections of the remote users. It receives from the 
client both an XML file, containing the metadata of users 
asking for service, and the related audio files (in WAVE 
format), obtained from registration. Vocal files and 
metadata are archived into the database. 

The server executes a preliminary elaboration of the 
signal (preprocessing) to extract from the audio sample 
various information useful for classification. At this point 
the classification procedure of the vocal signal is run by 
using the parameters defined in a preliminary phase with 
the doctors and based on their experiences and using a 
statistical study of available samples (see the next 
subsection). For a returning user a comparison with the 
previously registered samples is foreseen, to evaluate the 
temporal evolution of the user voice. 

On the server side a different web interface allows 
doctors and specialists to analyze the stored voice 
samples. Data coming from user submissions are 
automatically stored into the database where a simple 
Electronic Patient Record (EPR) stores voice samples, 
signal parameters, metadata and information about 
patients. Using such interface the doctor can: 

e visualize the last entered voice samples requiring 
attention; 

e load, listen and compare voice signals (e.g. for 
patients that had a surgery intervention); 

e analyze them with the implemented voice analysis 
module (see Fig. 3). 


C. Vocal signals analysis techniques 


The classification of the vocal samples requires a 
suitable elaboration, to extrapolate from the audio 
registration a set of significant parameters. For such a 
purpose, computations are usually performed mapping 
signal data into the frequency domain [4]. 

The main parameters of clinical interest, considered for 
the evaluation, are: 

e Fundamental frequency tracking (linked to laryngeal 
and vocal folds pathologies) as well as irregularities in 
vocal folds oscillation (jitter and shimmer). 

e Measures of dysphony (voice quality indexes, based 
on "noise" estimation, as caused by irregularities and 
pathologies producing turbulences in the air flow from 
the glottis). 

The pitch estimation is performed via two approaches 
which use respectively the Average Magnitude 
Difference Function (AMDF) and Simple Inverse Filter 
Tracking (SIFT) [5, 6]. 

In the first approach the estimate of the fundamental 
frequency value (fọ) is obtained by filtering the signal 
with a proper Continuous Wavelet Transform (CWT) and 
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extracting its time periodicity by means of the AMDF 
method [7]. 


Fig. 3 Doctor view: spectral analysis of voice samples. 


Given a signal frame of length M: {x(k)}& =0,...,M, 
the AMDF is defined as: 


AMDF(n) 


nh n=0,....M-1 


The scale factor (m-p)' eliminates the decreasing 


trend of the AMDF method, due to the truncated sum. For 
noisy signals, the AMDF minimum is usually greater than 
zero. Hence, in order to recover fo, one has to select the 7 
value that gives the minimum of the AMDF function 
(7 =F, / f, » where F; is the sampling frequency [7]). 


The method is appealing, due to the low computational 
burden, but it is more sensitive to the noise than other 
approaches. 

In the second approach, which relies on Linear 
Prediction (LP) analysis of data, the vocal tract is 
described through an Auto Regressive (AR) model. The 
following procedure is implemented on each data frame 
of length M = 1/F; where Fint is the lowest value in the 


inf ? 
ASS, range of interest for fo: 
estimation of the correct order p of the model 
AR by means of Singular Value Decomposition 
(SVD) approach; 

- computation of the AR parameters, which enable 
to determine the varying vocal tract inverse filter 
IF, through the forward — backward algorithm 
[8]; 

- estimation of the residual sequence by applying 
the signal to the filter IF; 

- band — pass filtering of the residual sequence in 
the range 50 — 1.5 KHz and evaluation of the 
maximum of the autocorrelation sequence (AS) 
of the residuals in the frequency range of 60 — 
250 Hz (f,=F./r, where t is the index 


corresponding to the maximum of the AS). 

The computational complexity is rather high, but this 
procedure is one of the most robust and accurate. 

A measure of the dysphonic component of the voice 
spectrum related to the total signal energy is evaluated by 
using the NNE index [9]. Given the speech signal 
x(n) = s(n)+ w(n), where s(n) is the periodic component 
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and w(n) is the additive noise component, let X(4), S(k) 
and W(k) be the discrete fourier transform (DFT) of x(n), 
s(n) and w(n), respectively. The adaptive NNE (ANNE) 


is defined as 
Ny 5 
AG) 
ANNE(Kk)=101og| 2 E k=N,,.. N 
AG 
m=N, 
where N, = [nf,T | Ny = [nf ,T | 
DFT points, L=number of frames in the analysis 
interval, and f and f respectively the lowest and the 
highest frequencies of the frequency band of 


interest. W, (uJ is an estimate of the unknown noise energy 


H 


N = number of 


2 


W.(k) 


m 


Ne (ky is the signal energy and T is the sampling 


> 


period. At lower ANNE values, the noise energy is larger 
on that signal frame. The signal is more noisy for ANNE 
values close to zero. 

The voice signals analysis was implemented using the 
software Matlab. 


TV. DISCUSSION 


The main goal of the proposed system is the realization 
of a web based system for the acquisition and automatic 
analysis of vocal signals. It is important to remark that the 
goal of the proposed instrument is neither to replace the 
doctor specialist, nor to provide a diagnosis; rather it is 
aimed to give a response about the potential presence of 
pathologies of the larynx or the vocal tract, and to advise 
potentially affected patients to go to a specialist for an 
accurate voice control. 

The possibility to produce in a simple and rapid way 
the detection of voice alterations for a possible huge 
amount of users, is one of the main requirements of the 
system. This is an important goal required and suggested 
by clinical experiences, where patients with voice 
anomalies often delay specialist's controls, in most cases 
limiting treatments effectiveness. Thus, the idea behind 
the system raises from the need of educating patients to 
the auto diagnosis by using a simple, remotely accessible 
and user friendly system. 

The system will be made completely and freely (prior 
to free registration) accessible from a web portal. This 
solution offers several advantages: 

e elimination of the discomfort due to time and/or 
distance constraints, that often induce the patient to 
indefinitely postpone the specialist's visit; 

e removal of a possible psychological block in 
presence of the doctor (due, for example, to the fear 
deriving from a possible investigation via endoscope); 

e since the system can be freely accessed on the 
Internet, even the less wealthy patients may use it, also 
when the suspect of a pathology is very light. 
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The use of the client-server system would allow a 
diagnostic analysis from the client (patient) side and, at 
the same time, will allow populating a national scale 
database containing several types of vocal anomalies. 


V. CONCLUSION 


The paper presented a web-based system for the 
remote acquisition and automatic analysis of vocal 
signals. Vocal signals are submitted by the users through 
a simple web-interface and are analyzed in real-time by 
using state-of-the art signal processing techniques, 
providing first-level information on possible voice 
alterations. 

Future work will regard the experimentation of the 
system in the Department of Otorhinolaryngology of our 
University for full clinical validation and for post-surgery 
control, i.e. for checking the status of patients after 
surgical intervention and during follow-out. 
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Abstract: Techniques for the visualization of high- 
dimensional data are common in exploratory data 
analysis and can be very useful for gaining an 
intuition into the structure of a data set. The classical 
method of principal component analysis is the one 
most often employed, however in recent years a 
number of other nonlinear techniques have been 
introduced. In the present paper, principal component 
analysis, and two newer methods, are applied to a set 
of speech data and their results are compared. 
Keywords : PCA, LLE, Kernel PCA 


I. INTRODUCTION 


Techniques which transform a high-dimensional space 
into a space of fewer dimensions, often with one, two or 
three-dimensions, are collectively known as 
dimensionality reduction techniques. They can be very 
useful in helping us visualize data sets which we are 
trying to analyze, often providing clues about properties 
of the data, such as possible clusters within the data. 

The most commonly used classical method for 
dimensionality reduction is perhaps principal component 
analysis (PCA), also known as the Karhunen-Loève 
transform, or singular value decomposition [1]. PCA 
performs a linear mapping of the data to a lower 
dimensional space in such a way, that the variance of the 
data in the low-dimensional representation is maximized. 
A disadvantage of PCA is that the embedded subspace 
has to be linear. For example, if the data are located on a 
circle in a 3-dimensional Euclidean space, R°, PCA will 
not be able to identify this structure. Another 
disadvantage is that PCA depends critically on the units 
in which the features are measured. 

In recent years, a number of other visualization 
techniques have become available, and their application 
to data sets, such as those involving speech, is just being 
conducted [2,3,4]. Among these methods, those of kernel 
PCA (KPCA) [5] and local linear embedding (LLE) [6,7] 
are particularly relevant for our purposes in the present 
paper. 

KPCA is a (usually) nonlinear extension of PCA using 
kernel methods. Kernel methods have been successfully 
applied in the fields of pattern analysis and pattern 
recognition [8], often providing better classification 
performance than other methods, and frequently playing a 


vital part in the nonlinear extension of classical 
algorithms.. LLE, on the other hand, provides low- 
dimensional, neighborhood-preserving embeddings. This 
means that points which are ‘close’ to one another in a 
data space will also be close when projected onto the 
low-dimensional space. 

In the paper, our aim is to briefly describe these 
methods, and then apply and compare them on a set of 
normal and pathological speech data. 


II. METHODOS 


PCA is an unsupervised learning algorithm that 
attempts to efficiently represent the data by finding 
orthonormal axes which maximally decorrelate the data. 
The data is then projected onto these orthogonal axes. 
The principal components are precisely this set of q 
orthonormal vectors, where q is often 2 or 3. 

There are several equivalent ways to find the principal 
components, one being that of finding the first q 
eigenvectors w of the covariance matrix C of the data set, 
corresponding to the q largest eigenvalues. 
Mathematically, if (x,,...,xw) is a zero mean data set from 
the Euclidean space R”, then the covariance matrix is 
given by: 


C= Sixx)" (1) 


and the corresponding eigenvalue equation is 
Cw = Aw (2) 


PCA provides a linear mapping of the data onto the 
lower g-dimensional space, and suffers from several 
problems, some of which have been mentioned in the 
introduction. In order to define a nonlinear extension of 
PCA, KPCA has been introduced. KPCA uses the notion 
of a kernel to modify the corresponding algorithm. 
Generally, if X is a data set, then a (positive-definite) 
kernel k on XxX is defined as a real-valued function: 


k:XxX—> R (3) 
such that: 


(1) kis symmetric: k(x,y) = K(y,x) V xy € X, and 
(ii) kis positive definite: V n= 1 
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N 
aja jK(x;,x ;) 20 (4) 


i,j=l 


v A]... Ay € R and XL... Xy € X 

It can be shown that given a kernel k, there exists a 
(Reproducing Kernel) Hilbert space H and a 
transformation g: X+ H such that 


ky) = <), OY)> (5) 


holds. H is often referred to as feature space and is often 
infinite-dimensional. 

The most commonly used kernels are the polynomial 
and radial base function kernels defined on R™x R” by: 


k(x,y) = (<xy> + 1) (5) 


k(x,y) = exp(-|be-y||’/20°) (6) 


respectively, where d= 1,2,... and o e R. For these 
kernels the transformation g is not defined explicitly, and 
the kernels are applied directly in the original data space. 
This is known as the ‘kernel trick’. 

For the kernels of Eq (5) and (6), it can be shown that 
KPCA is conceptually the same as performing standard 
PCA with the data set {¢(x,),..., @(xn)} in the feature 
space H (with the above notation). Fortunately, the kernel 
trick, referred to above, can also be applied in this case 
and the explicit use of ø avoided. Instead, the NxN kernel 
matrix K, is defined through K; = k(x,x;), and the 
equation: 


Ka = Na (7) 


is solved for à e R and a= (ay,...,ay)' e RY. 
A projection p of a pattern y in data space onto a 
principal component in feature space can be found using: 


N 
p= dak y,x;) (8) 


i=l 


In order to use KPCA, we have to decide on a kernel 
function and, as for PCA, the number of dimensions on 
which to project. 

LLE is an unsupervised learning algorithm that 
computes low-dimensional, neighborhood-preserving 
embeddings of high-dimensional inputs. LLE does this by 
applying three steps. First, for each point in the data, it’s 
k nearest neighbors to the other points in the data are 
found (usually using Euclidean distance, although in the 
present paper other distance metrics are also tried). Then, 
each point is approximated by convex combinations of 
it’s k nearest neighbors, to obtain a matrix of 
reconstruction weights W. Finally, low-dimensional 
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embeddings Y; (usually in a space of one or two- 
dimensions) are found such that the local convex 
representations are preserved. Mathematically, this 
process can be expressed by: If {x,,...,Xy} is the dataset, 
and for each vector x; we let N; denote the indices of it’s k 
nearest neighbors, then the second step, of finding the 
reconstruction weights W, corresponds to minimizing the 
objective function: 


2 


E(W) = 2 x- LW, x; (9) 


JEN; 


subject to NV, =1. 
J 


The embeddings {y,,...,.yv} of the original data, 
corresponding to the third step, are obtained by 
minimizing the following objective function: 


2 


ow) =J i- 9 Wy (10) 


JEN, 


An advantage of LLE is that it has few free parameters 
to set and a non-iterative solution thus avoiding 
convergence to a local minimum. 

Interesting relationships have recently been found 
between KPCA and LLE, as well as other well-known 
dimensionality reduction techniques c.f. [10]. 


HI. DATA 


In the present paper, the data used consisted of real 
voice samples of the sustained vowel ‘ah’ for both 
normal patients and those with dysphonic speech 
disorders. The voice samples were taken from the 
“Disordered Voice Database” [11], acquired at the 
Massachusetts Eye and Ear Infirmary Voice and Speech 
Laboratory and distributed by Kay Elemetrics. The 
clinical information includes diagnostic information 
along with patient identification, age, sex, smoking 
status, and more. The files on normal subjects were 
collected at Kay. 

The eight variables used in the paper are the same as 
those chosen in [12], namely: degree of voice breaks, 
three variables related to jitter (local, relative average 
perturbation, five-point period perturbation quotient), 
three related to shimmer (local, three-point amplitude 
perturbation, eleven-point amplitude perturbation), and 
harmonics-to-noise ratio. 

For completeness, we include their definitions (c.f. 
[12] for more details): 

1) Degree of voice breaks is the total duration of the 
breaks between the voiced parts of the signal, divided by 
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the total duration of the analyzed part of signal. Silences 
at the beginning and at the end of the signal are not 
considered breaks. 

2) Jitter or period perturbation quotient 

a) Jitter ratio (local) or jitt is defined as: 


n 7 
‘itt = 1000 i=l 11 
j 


where P; is the period of the i” cycle, in ms, and n is the 

number of periods in the sample. 

b) Relative average perturbation (RAP): 

Vi BatB + Pa P 
3 1 


1 1 
n—-24 
RAP = = 


(12) 


n-4 i=3 


ppq5 = n (13) 
DR 
Aja 
3) Shimmer or amplitude perturbation quotient 
a) Shimmer (shimm): 


al (14) 


i=l 
where A; is the amplitude of the ‘ha cycle, and n is the 
number of periods in the sample. 
b) Three-point amplitude perturbation quotient (apq3): 


(15) 


c) Eleven-point ampitude perturbation quotient (apq11). 
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apqll= (16) 


4) Harmonics-to-noise ratio: This parameter quantifies 
the amount of glottal noise in the vowel waveform. In 
contrast to perturbation measures, it attempts to resolve 
the vowel waveform into signal and noise components, 
computing their energies ratio. 


In total there were 34 subjects with dysphonic speech 
disorders, and a further 53 normal subjects. For each 
subject, an 8-variable vector was associated. The 
minimum, maximum and standard deviation for each of 
the eight variables is given in Table 1. 


Table 1. minimum, maximum and standard deviation for each of 
the 8 variables for the normal and pathological data 


Normal 

0.105 | 0.048 0.070 0.064 0.375 0.567 | 0 17.52 
0.682 | 0.368 0.447 0.463 3.011 3.770 | 0 30.37 
0.11 0.069 0.067 0.088 0.589 0.744 | 0 2.941 
Pathological 

0.131 | 0.064 0.074 0.119 0.654 0.937 | 0 2.515 
6.061 | 3.701 4.783 1.756 10.80 16.63 | 0.164 | 28.04 
1.4233 | 0.8221 1.0783 |0.431 2.524 3.304 | 0.035 | 6.83 


IV. RESULTS 


PCA, KPCA, and LLE were applied to the real voice 
samples described in the previous section. Software for 


Fig. 1 PCA, LLE, and KPCA applied to the data with 2 dimensions. 
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these techniques has been developed by [13,14]. Fig.1 
shows the three techniques applied to the data and 
projected onto two-dimensions. In this case, k=8 was 
chosen for LLE, and a radial base function kernels with 
o=] for KPCA. 

In Fig.2, the same parameters are used but with the 
data projected onto three-dimensions. 


ye 


Fig 2. PCA, LLE, and KPCA applied to the data with 3 dimensions. 


In order to obtain a simple comparison between the 
three methods, a k-nearest neighbor classifier was applied 
to the projected data using k=1,3. For this, the data was 
split randomly into training and test sets subsets with 
sizes of 66% and 33%, respectively. The classification 
results are shown in Table 2, where in the first row, LLEn 
means that k=n was taken, and Gn means that o=n was 
used. 


Table 2. Results of applying knn to the projected data 


PCA | LLE3 | LLES | LLE8 | G0.5 Gl G5 


Two-dimensions 


k=1 68.97 | 55.17 | 68.97 | 79.31 | 68.97 | 65.52 | 75.86 


k=3 79.31 | 68.97 | 79.31 | 72.41 | 65.52 | 75.86 | 68.97 


Three-dimensions 


k=1 79.31 | 55.17 | 68.97 | 72.41 | 65.52 | 75.86 | 75.86 


k=3 65.52 | 65.52 | 79.31 | 72.41 | 65.52 | 75.86 | 72.41 


V. CONCLUSIONS 


In the present paper, the dimensionality reduction 
techniques of PCA, KPCA, and LLE were applied to 
speech data from both normal and pathological subjects. 
The data has been projected onto both two and three- 
dimensional Euclidean spaces, and different parameters 
occurring in KPCA and LLE have been varied. The 
projected data is shown in Figs.1 and 2. 

In order to obtain a simple comparison between the 
three methods, a k-nearest neighbor classifier was 
introduced and applied to the projected data. In Table 2 it 
can be seen that LLE, along with PCA, achieve the best 
classification performances. Whilst this is obviously not a 
definitive result, and will depend on the data set and 
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parameters employed, it is encouraging and provides 
motivation to continue the exploration of alternative 
methods to PCA in the case of speech data. 
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Abstract: MDVP and Praat are computer programs 
commonly used for acoustic analysis of voice in 
clinical and research settings. Both softwares extract a 
set of acoustic parameters, many of which are defined 
similarly. The purpose of the present study was to 
compare the results obtained by both programs, and 
examine whether they can clinically distinguish 
among pathological groups differently than the other. 
Fifty-eight women participated in the study. Of these 
women, 28 were diagnosed with functional dysphonia 
and 30 women were diagnosed with benign mass- 
lesions (ten nodules, ten polyps and ten cysts). Voice 
samples, which consisted of six productions of the 
vowels /a/ and /i/, were analyzed using MDVP and 
Praat. Results show similar mean fundamental- 
frequency (mF0) values for both programs (P>0.05). 
However, jitter, shimmer, Noise-to-harmonic ratio 
(NHR) and degree of unvoiced (DUV) were 
significantly lower using Praat, in comparison with 
MDVP. High correlation coefficients were found 
between the parallel pairs of acoustic parameters 
extracted by the two programs. Jitter values obtained 
using MDVP, for the vowel /i/, revealed a significant 
group difference between the nodule and cyst groups 
(P<0.05). This group contrast was not observed using 
Praat. Results suggest that although high correlations 
are found between values obtained by both programs, 
individual numerical values vary greatly. Therefore, 
combining results from both programs is not 
advisable. In addition, there are indications that 
linear transformation for the results from one 
program to the other might lead to erroneous 
conclusions, and should be carried out with caution. 
Keywords: Acoustic analysis, MDVP, Praat, Clinical 
implications. 


I. INTRODUCTION 


Acoustic analysis of voice is considered valuable for 
quantifying measures of voice quality in various 
experimental as well as clinical settings. The validity of 
this tool has been challenged by many studies, since it is 
yet unclear which set of acoustic measures best represents 
voice quality. Moreover, the relationship between 
vibratory properties of the vocal folds and specific 
acoustic measures has not been substantiated yet. While 
previous studies have included various sets of acoustic 
measures, the majority of these studies have examined, 


among other parameters, fundamental frequency (FO), 
measures of frequency-perturbation (e.g., jitter), measures 
of amplitude-perturbation (e.g., shimmer) and various 
noise-indices. 

Ostensibly, the values for the perturbation measures 
mentioned above should not be dependent on the 
software used to calculate them. Jitter and shimmer, for 
example, are defined by simple and standardized 
formulas [1]. The problem lies in the raw data on which 
these calculations are based, i.e. the FO contour. 
However, there is no standardized algorithm for 
calculation of FO. While different methods for calculating 
FO may yield relatively small differences in mean FO, 
they can largely influence the perturbation measures. This 
introduces a difficulty for the clinical voice specialist, 
because the different programs could report different 
values, when analyzing identical voice samples. The 
discrepancy between results obtained by such programs 
was previously noticed and addressed by various 
researchers [2,3]. 

In the present study, we examined the clinical results 
of the analyses performed by two programs: MDVP (Kay 
Elemetrics) and Praat (Boersma & Weenink). These 
programs are commonly used for acoustic analysis in 
clinical as well as research settings, and while MDVP is a 
commercial package, Praat is distributed for free use. 
Both softwares provide a calculation of a set of parallel 
acoustic measures. Therefore, we were interested to learn 
whether: (1) the two programs provide similar or 
different values for this set of basic acoustic measures; 
and (2) whether the results obtained by one of the 
programs would distinguish better between specific 
pathological groups. 


II. METHODS 

Participants: Fifty-eight women who were examined in 
the Voice Clinic at the "Sheba" Medical Center, Tel- 
Hashomer, were included in the study. All patients were 
women over the age of 18, and all patients had undergone 
a laryngeal stroboscopy and a voice evaluation. Of these 
women, 28 were diagnosed with functional dysphonia 
(i.e., patients were dysphonic, with no organic finding). 

Thirty women were diagnosed with vocal fold benign 
mass-lesions. Of these women, 10 were diagnosed with 
vocal nodules, 10 with polyps and the remaining 10 were 
diagnosed with cysts. 
Recordings: Each patient was recorded, individually, in a 
quiet room. Recordings were performed using a 
Sennheiser PC160 headset microphone, connected 
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directly to a computer, with a sampling rate of 48 kHz. 
Each subject was recorded producing the vowels /a/ and 
/i/ six times. 

Acoustic analyses: All recordings were analyzed twice: 
using MDVP and using Praat. The MDVP analyses were 
performed manually. During the analyses, pitch 
limitations were performed, when necessary, to avoid 
erroneous FO values. The Praat analyses were performed 
automatically, controlled by a specially written Matlab 
program. In these analyses, FO identification was set to a 
range of 110-500 Hz, to minimize octave errors. 
Although the two programs provide extensive sets of 
acoustic parameters, only five parallel measures were 
included, that are calculated by both programs. These 
measures included mean fundamental frequency (mFO), 
jitter, Shimmer, noise-to-harmonic ratio (NHR) and 
percentage of unvoiced segments (referred to as DUV in 
MDVP and as DEG in Praat). 

Both programs calculate FO using algorithms based on 
the autocorrelation method [4,5]. Nevertheless, there are 
differences between the two implementations, which 
cause noticeable differences between the results obtained 
by the two programs. The details of the implementations 
are well documented, though to the best of our 
knowledge, there is no comparison of their absolute 
accuracy. Fig. 1 illustrates an example of the differences 
between the two programs in tracking FO. In this figure, 
the calculated FO points are presented over a short 
segment, for a single file that was included in this study. 
Apparently, MDVP presents a larger spread of values in 
comparison with Praat, though overall FO means are 
similar (181.07Hz in MDVP and 181.16Hz in Praat). This 
is further corroborated in the following section. 
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Fig. 1. FO values calculated by Praat and MDVP over a 
short segment 


Statistical Analyses: Results for the repeated recordings 
were averaged prior to the statistical analyses. Separate 
Analyses-of-Variance were performed for each vowel. In 
these analyses, Pathology (nodule, polyp, cyst and 


MAVEBA 2007 


functional) was treated as a main factor, and Programs 
(MDVP and Praat) was treated as a repeated factor. In 
addition, Pearson correlation coefficients were calculated 
to compare between the results obtained by the two 
programs. 


III. RESULTS 

Table 1 presents the results of the acoustic analyses 
performed by the two programs for the four pathological 
groups. Results show that similar numerical values were 
obtained for mF0 in the two programs. However, the 
values obtained for the jitter, shimmer, NHR and DUV 
measures were, in general, higher in MDVP than those 
obtained using Praat. Statistical analyses revealed 
significant differences between the two programs for 
Jitter [(F1,53=68.84, p<0.001), (F1,53=49.29, p<0.001), for 
/a/ and /i/ respectively], Shimmer [(F, 53=3.61, p=0.063), 
(Fi 53=5.11, p=0.028), for /a/ and /i/, respectively], NHR 
[(F1,53=336.16, p<0.001), (F1,53=408.48, p<0.001), for /a/ 
and /i/ respectively] and for DUV  [(F, 53=26.70, 
p<0.001), (F¡53=32.88, p<0.001), for /a/ and // 
respectively]. No significant differences were found 
between the two programs for mFO (p>0.05). 
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Fig. 2. Individual participants’ jitter values for /a/, 
calculated by MDVP versus Praat, along with linear 
regression and correlation coefficient: (a) full range of 
jitter values; (b) jitter values range (MDVP) between 0 
and 3%; (c) Jitter values (MDVP) >3%. 


Pathology detection/classification I 


No significant main effect was found for Pathology, 
for any of the acoustic measures tested (p>0.05). A 
significant Program X Pathology interaction was found 
only for the jitter measure in the vowel /i/ (F753=3.88, 
p=0.014). Post-hoc analysis revealed a significant group 
difference between the nodule group (mean=1.67, 
SD=1.40) and cyst group (mean=3.16, SD=1.77), when 
analysis was performed using the MDVP program 
(p<0.05). This group difference was not observed using 
the Praat program (p>0.05). 

Finally, high correlation coefficient values were 
observed between the results obtained in the two 
programs. Correlations for mFO were 0.963<r<0.970. 
Correlations for the perturbation measures ranged 
between 0.719<r<0.932. However, correlations for the 
DUV measure were moderate (0.481<r<0.672). It should 
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be noted, though, that although high correlation 
coefficients were obtained for most parameters, further 
inspection of the data revealed additional information. 
Fig. 2, for example, presents the correlation between the 
jitter values for the vowel /a/, obtained using MDVP and 
Praat. It is evident that a high correlation coefficient 
value was obtained when computing the correlation over 
the entire range of values. However, when the sample 
was limited to stimuli with relatively lower jitter values 
(0 to 3%), the correlation decreased to 0.39, even though 
this range covered the majority of values. In contrast, 
when voice samples with higher jitter values were 
examined, the correlation coefficient was high, though 
the sample size was smaller. Similar findings were 
observed for all other parameters and vowels. 


Table 1. Results of Acoustic Analyses Performed by the MDVP and Praat Programs for the Four Pathological Groups. 


Vowel Parameter MDVP Pratt 
Nodule Polyp Cyst Functional Nodule Polyp Cyst Functional 
mF0 197.65 206.40 225.36 201.62 198.36 205.96 217.61 204.64 
(Hz.) (22.73) (36.41) (225.36) (38.51) (22.46) (36.45) (45.43) (35.95) 
Jitter 1.77 2.39 2.00 2.01 1.16 0.93 0.96 0.85 
(%) (1.68) (1.06) (1.47) (1.64) (1.37) (0.27) (0.77) (0.91) 
/al Shimmer 7.00 7.76 6.49 5.85 5.01 7.58 6.30 5.12 
(%) (9.18) (3.46) (3.20) (4.30) (5.09) (3.75) (4.37) (3.91) 
NHR 0.19 0.20 0.15 0.17 0.09 0.09 0.07 0.06 
(0.13) (0.09) (0.04) (0.11) (0.16) (0.08) (0.10) (0.10) 
DUV 14.06 18.13 8.66 12.70 2.57 0.50 1.53 1.25 
(19.46) (18.13) (8.63) (19.16) (5.49) (0.58) (2.46) (3.10) 
mF0 211.20 214.20 220.80 206.32 211.21 213.38 217.64 209.33 
(Hz.) (25.01) (36.16) (34.40) (38.68) (25.07) (34.08) (35.72) (35.51) 
Jitter 1.67 2.49 3.16 1.94 1.15 1.09 1.20 1.20 
(%) (1.40) (1.05) (1.77) (1.29) (1.44) (0.54) (0.91) (1.63) 
Ail Shimmer 4.57 5.88 5.93 4.72 3.01 5.92 5.25 3.84 
(%) (4.88) (2.78) (4.07) (5.02) (3.20) (3.41) (4.84) (4.74) 
NHR 0.15 0.16 0.17 0.15 0.04 0.05 0.04 0.04 
(0.06) (0.04) (0.06) (0.08) (0.06) (0.04) (0.04) (0.07) 
DUV 10.31 10.21 11.68 0.93 0.53 1.86 1.59 1.43 
(11.84) (7.35) (13.11) (12.86) (0.94) (3.12) (2.31) (4.28) 


IV. DISCUSSION 

The results of our study support previous findings, 
suggesting that different programs present different 
values of acoustic measures. This is attributed to 
algorithmic differences between the programs (see 
Boersma & Winink, Praat manual). On the one hand, 
our data show that in most cases, similar group 
differences (or lack of differences) were obtained in 
both programs, and strong correlations were found 
between the two programs. Furthermore, mean FO 
values are also similar for the two programs. These 


findings could support common use of both programs. 
On the other hand, values of the perturbation and noise 
measures were notably different between the two 
programs, and under specific conditions, MDVP 
appeared to differentiate among pathological groups 
better than Praat. The latter finding suggests that 
combining results from the two programs, for clinical 
purposes, is not recommended, despite the use of the 
seemingly parallel acoustic measures. 
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It is interesting to observe that the strong 
correlations between the values calculated by the two 
programs initially suggested that values from one 
program can be linearly transformed to approximate the 
values calculated by another program. A more detailed 
analysis showed this to be inaccurate. As shown in Fig. 
2, examining jitter values between 0 and 3% only, 
which covered the majority of the cases we studied, 
revealed a far lower correlation between MDVP and 
Praat values. Similar results were observed for other 
measures and vowels. This further suggests that results 
obtained from both programs are not comparable. 

Finally, based on these findings, it should be noted 
that the use of the reported thresholds for "normal" 
voice, as presented by MDVP, for example, should be 
restricted to measures calculated by a specific program, 
and could not be used for analyses made with other 
programs. This is especially pertinent when examining 
measures that are based on cycle-to-cycle variation. 
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Abstract: 3D FE model of the larynx including the 
vocal folds, arytenoid, thyroid and cricoid cartilages 
was developed. The vocal fold tissue is modeled as a 
three layered material representing the epithelium 
vocal ligament and muscle. First, the frequency 
modal analysis of the model was performed for 
nonlinear material characteristics and increasing 
pre-stress of the vocal folds. Then the results of 
numerical simulation of the vocal folds oscillations 
excited by a prescribed aerodynamic pressure 
loading the surface of the tissue is presented . The 
FE contact elements are used for modeling the vocal 
folds collisions. 

Keywords: Biomechanics of human voice, parametric 
FE model of the human larynx, numerical simulation 
of the vocal folds vibration. 


I. INTRODUCTION 


Design of a model of the human vocal folds, which 
would enable to model some pathological situations and 
voice disorders, is becoming an important part of the 
voice research. Having in mind an intention, to estimate 
vocal fold tissue damage from the changes in vibration 
regimes of the vocal folds, a new three-dimensional 
fully parametric finite element (FE) volume model of 
the larynx was developed. The model respects the 
phonation position of the vocal folds and enables easily 
to vary their geometrical configuration, the longitudinal 
tension (pre-stress) and the nonlinear material properties 
of the individual vocal fold tissue layers. The geometry 
and relations between the arytenoids, thyroid and 
cricoid cartilages was derived from CT images of a 
physical enlarged resin model of the human larynx from 
the collections of the Anatomical Institute of the 3" 
Medical Faculty of the Charles University in Prague and 
on the bases of the book [6]. This model is a copy of the 
original physical model from Germany (Deutches 
Hygiene-Museum, Institute fiir biologisch-anatomische 
Anschauungsmaterialen, Dresden). 


II. METHODS 
A. FE model 
The 3D complex dynamic FE model of the human 
larynx was developed by transferring the CT image data 
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from the DICOM format to the FE mesh. The 
geometrical configuration of the cross-section of the 
vocal fold was taken from Hirano [3] and three layers of 
the vocal fold tissue are considered: epithelium, vocal 
ligament and muscle with different physical and 
material properties (see Fig. 1). Full parameterization of 
the model enables to vary the thickness and material 


properties of the individual layers. 
epithelium 


thyroid 
cartrilage 


Fig.1 Schema of the vocal fold with three layers. 


The model enables to take into account longitudinal 
tension (pre-stress) and adduction of the vocal folds by 
positioning of the arytenoids and thyroid cartilages— see 
Fig. 2. The initial position corresponds to the original 
CT images of the physical model. The model was 


Arytenoid cartilage Loose tisue 


Thyroarytenoid 
muscle 


Epithelium 


Thyroid cartilage 


Cricoid cartilage 


Fig.2 FE model of the human larynx with the vocal 
folds between the arytenoids and thyroid cartilages. 
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created by 3D quadratic volume and shell finite 
elements. 


B. Material parameters 

The ligament layer consists of the tissue fibers that are 
oriented in the longitudinal direction z between the 
arytenoids and thyroid cartilages. The stiffness of the 
vocal fold tissue in this direction is substantially higher 
than the stiffness in the perpendicular direction x. This 
is the reason why a plane orthotropic model was used 
[1], where the matrix of the elastic constants is defined 
as 
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where E, is Young modulus, u, is Poisson number and 
G, is shear modulus in perpendicular direction x to the 
ligament fibers. Analogical constants are denoted by the 
index / for the longitudinal direction z. The cartilages 
were modeled by an isotropy material. For a loose 
connective tissue between the vocal fold muscle and the 
thyroid cartilage a model of an incompressible material 
was used. The material constants considered for the 
tissues are summarized in Tab. 1. 


Tab.1 Considered nominal values of material constants 
of individual tissue layers according to [2] - 
E=Epithelium, L=Ligament, M=Muscle, C=Cartilage, 
LT=Loose connectiveTissue. 


E L M C LT 
G, [kPa] | 0.526 | 0.868 | 1.052 | - - 
G; [kPa] 10 40 12 5 5 

Up 0.9 | 0.9 | 0.9 | 0.47 | 0.4999 
E, [kPa] 2 3.3 4 30 | 0.12 
Efe) [kPa] | 100 10 5 - - 
p[kgm°] | 1020 | 1020 | 1020 | 1020 | 1020 
Hp = Hip 0 0 0 x - 


Orthotropic properties of the three layers of the vocal 
fold living tissue (epithelium, vocal ligament and 
muscle) are modeled by respecting the material 
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nonlinearities with increasing prolongation € of the 
tissue. Nonlinear stiffness of the tissue fibers was 
considered in the longitudinal direction z. The Young 
modulus in relation to the strain for all three layers is 
shown in Fig .3. 
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Fig.3 Young modulus of the epithelium, ligament and 
muscle versus the strain [4]. 


II. RESULTS 


The frequency-modal characteristics of the model 
were computed for increasing tension of the vocal folds 
and an influence of 20% changes in uncertain values of 
material characteristics of the tissues was modeled. The 
frequency-modal properties of the FE model are shown 
in Tab. 2 and Figs. 4-6. 


Tab.2 Changes of vocal fold eigenfrequencies with 
increasing vocal fold tissue prolongation € in the 
longitudinal direction. 


e [%] F [Hz] F [Hz] F [Hz] 
5 107.40 130.50 140.41 
15 137.82 154.50 163.44 
25 165.68 177.66 185.75 
35 193.68 201.71 209.24 


Tab.3 Participation factor for x, y and z direction for the 
strain €=5% and first three eigenfrequencies. 


excitation x y Z 
direction 
F; [Hz] Y Ky Y 
107.4 0.477E-03 0.736E-03 0.947E-08 
130.5 0.375E-03 0.299E-03 0.675E-07 
140.4 0.459E-03 0.282E-04 0.776E-07 
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Fig.4 First eigenmode of the FE model of the right vocal 
fold - F;,=107.4 Hz. 


Fig.5 Second eigenmode of the FE model of the right 
vocal fold - F:=130.5 Hz. 


A dominant vibration direction for each eigenmode 
was studied by using the participation factor y, , which is 


a measure of a coincidence of one selected eigenmode 
with the forced mode shape of vibration when the 
structure excited in a given direction: 


5 max{9'M D} i 


where @; is the eigenmode, M is the mass matrix of the 
structure and D is the forced mode shape of vibration 
excited in the direction x, y or z. The calculated 
participation factor for first three eigenmodes and all 
three directions x,y,z are summarized in Tab. 3. The 
displacements in horizontal and vertical directions x and 
y, respectively, are dominant for the first mode for 
which a rotation around the longitudinal axis z prevails. 
The vibration in the horizontal direction x dominates for 
the second eigenmode, while for the third eigenmode, 
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the vibration amplitudes of the membranous part of the 
vocal fold tissue prevail in the vertical y direction. 


Fig.6 Third eigenmode of the FE model of the right 
vocal fold - F3=140.4 Hz. 


Then the motion of the vocal folds was numerically 
simulated for a prescribed intraglottal pressure loading 
the vocal folds by a periodic function in the time 
domain -see Figs.7. 

The pressure signal loading the vocal fold surface was 
generated by the aeroelastic model [5] of the vocal folds 
during self-sustained vibrations for a given subglottal 
pressure and prephonatory glottal gap. Implementation 
of the contact elements on the vocal folds surface 
enabled to model the impact stresses in the vocal fold 
tissue layers during the vocal folds collision. 

The vibration response of the vocal folds after loading 
the tissue by the prescribed intraglottal pressure is 
shown in Figs. 8-10. 


Perinat 


0 6 Paster e] 


Fig.7 Aerodynamic pressure loading the vocal folds in 
time and space domain along the vocal fold surface in the 
vertical y direction during one period of the oscillation 
cycle — fundamental frequency FO = 100 Hz, subglottal 
pressure P, =378.4 Pa, prephonatory glottal half-gap 


go=0.2 mm. 
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Fig.8 Larynx displacements and vocal folds deformation 
at the maximum glottis opening phase of the forced 
vibrations generated by the prescribed pressure. 


Fig.9 Comparison of the deformation of the ligament 
layer in the central part of the vocal folds in the 
maximum opening of the glottis and the maximum glottis 
closure during the collision phase. 


Fig. 10 Contact area at the right vocal fold during the 
vocal folds collision shown as isolines of the vocal folds 
distance. 
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IV. CONCLUSIONS 


The geometry of the model is possible to modify 
easily as well as to apply optimization procedures for 
finding proper model parameters of the system in 
relation to the tuning both the vocal folds vibration 
characteristics, and the larynx model in general. 

The computated fundamental eigenfrequencies and 
mode shapes of vibration are qualitatively similar like 
for other simplified models in literature [2,7] and the 
obtained increase of the eigenfrequencies by increasing 
the vocal fold tension is also realistic. The considered 
changes in the material properties, in case of the 20% 
reduction of the Young modulus of the vocal fold tissue 
in the longitudinal direction were not found important. 
The generated motion of the vocal folds seems to be 
qualitatively similar to a vibration mode known from 
clinical measurements. 

Preliminary results show that model the contact 
elements on the vocal folds surface enable numerical 
simulations of the collisions of the vocal folds and to 
predict stresses in the vocal fold tissue due to the 
impacts. 
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RELATING VOCAL FOLD AMPLITUDE OF VIBRATION TO SKIN 
ACCELATION LEVEL ON THE ANTERIOR NECK 
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Abstract: The purpose of this research was to 
determine if a relationship between vocal fold 
amplitude of vibration and skin acceleration level 
could be found using regression techniques. The 
effects of accelerometer location and phonation 
frequency were examined. 

Keywords : vocal folds, vibration, acceleration 


I. INTRODUCTION 


The ability to measure amplitude of vocal fold 
vibration in vivo is of major importance in the field of 
speech science. In voice dosimetry, vocal fold vibration 
in human subjects during prolonged periods of speaking 
is studied in order to determine the effects of exposure to 
self-induced tissue vibration in vocalization [1]. 
Amplitude of vibration (A) is a variable in the calculation 
of two of the dose measures, distance dose Dq and energy 
dissipation dose D,. These measures are important for 
understanding vocal fatigue and recovery, especially 
among professionals who rely on their voice for their 
livelihood. 

In the current voice dosimetry study being conducted 
at the NCVS, the doses are calculated from skin 
acceleration levels (SAZ) measured at the jugular notch, 
on the anterior neck of the subjects. The derivation of A 
from SAL requires using a series of empirical equations 
based on previously published canine model and human 
subject data, and a calibration curve based on a lengthy 
data collection session in the laboratory with each 
dosimetry subject. 

The purpose of this research was to determine an 
equation relating SAL to A using regression techniques, 
for predicting A from SAL. Seven different sites on the 
anterior neck were investigated. Human SAL and A data 
were obtained in vivo during standard laryngeal exams 
using custom equipment and state-of-the-art imaging and 
audio recording and processing techniques. 


II. METHODS 
A. Subjects, Materials, Tasks, and Data Collection 
Two vocally healthy subjects, a male and a female, 


with no known vocal fold pathologies were administered 
videostroboscopic laryngeal exams with a rigid 


endoscope while wearing seven miniature accelerometers 
placed at various sites on the anterior neck, including at 
the jugular notch and above, below and lateral to the 
prominence of the thyroid cartilage. In order to obtain 
quantitative measurements of A in absolute dimensions, a 
two-point laser projection system was developed (Fig. 
la). The device projected two precisely-spaced green 
(wavelength = 532 nm) laser dots in the image frame, 
from which absolute dimensions could be determined in 
the laryngeal exam videos. Custom software was written 
to perform a frame-by-frame extraction of the absolute 
vocal fold length and glottal width at the mid- 
membranous point, and A was calculated as half the 
width (assuming symmetrical displacement of the vocal 
fold edge from the glottal midline for stable, periodic 
vibration of normal, healthy vocal folds). 

A lightweight, thin latex patch was designed to hold 
six accelerometers in a 2x3 array centered about the 
thyroid prominence, so that consistent, repeatable 
acceleration measurements could be made between the 
different subjects and different trials (Fig. 1b). The 
accelerometers were of the same type used in previous 
studies of long-term voice use [2], [3]. The patch was 
held firmly in place with a Velcro™ strap, and the 
surface of each accelerometer was attached to the skin 
with a temporary surgical adhesive. A seventh 
accelerometer was attached to the skin at the jugular 
notch with surgical adhesive and a small strip of medical 
tape. Fig. 2 shows the location of the seven accelerometer 
on the anterior neck. Fig. 3 shows a schematic of the 
experimental setup. 

Subjects were asked to perform a series of sustained 
phonations on the vowel /i/ at a number of different 
intensity levels from soft to loud, and at three different 
pitches — comfortable, high, and falsetto. The 
accelerometer signals were amplified and digitally 
recorded to the hard drive of a data collection computer, 
at a sampling rate of 44.1 kHz. The video of the laryngeal 
exams was digitally recorded to the videostrobe host 
computer. All audio and video signals were time- 
synchronized so that SAL and A data points could be 
directly related to each other. 


B. Data Processing and Statistical Analysis 
Data was obtained from two separate trials for both 


subjects, with at least one week between trials. The 
subject/data sets were designated M01-1, M01-2, F01-1 
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and F01-2. A cyclical plot of vocal fold amplitude of 
vibration A was extracted from the video signal for each 
data set, at a sampling rate equal to the video frame rate 
of 30 Hz. Strobe rate was set to Fast, yielding 1.5 glottal 
cycles per second and 20 frames per glottal cycle. 
Between 90 and 180 seconds of this cyclical 
representation was obtained for each data set. Root-mean 
square (RMS) values of A were obtained over 66.7 ms 
windows, overlapping by 33.3 ms (corresponding to a 
window size of one glottal cycle and an overlap of one- 
half cycle). The corresponding time segments of the SAL 
signals from all seven accelerometers were RMS- 
averaged with the same window duration and overlap, so 
the sequence of RMS values of A and SAL were still time- 
synchronized. 

Scatter plots were generated of the time-synchronized 
RMS values of A vs. SAL for each of the seven 
accelerometer signals. A was calibrated to mm and SAL 
was calibrated to m/s’. It was attempted to fit the data to a 
simple linear regression model, 


A = b, * SAL + b, (1) 


predicted 


where b; and by are the regression coefficients 
corresponding to the slope and intercept, respectively, of 
the regression line. The model was chosen based on the 
observation that, for sinusoidal vibration, the relation 
between displacement x and acceleration a is given by 


2 


wW 
armus TT NA X rus (2) 


where œ is the radian frequency of vibration. Statistical 
methods were employed to determine if there were 
significant differences between the fits obtained at the 
seven different locations, i.e., whether measuring the 
acceleration at different locations made any difference in 
the resulting fits; and if so, to determine which location 
showed the highest correlation to the vocal fold 
amplitude of vibration extracted from the video signal. 


HI. RESULTS 


In plotting the RMS values of A vs. SAL, it was found 
that there was a clustering of the data points according to 
the fundamental frequencies of the phonations. Since 
pitch was not a variable but rather a parameter of the 
study (each subject did the same phonations at three 
different frequencies), it was decided to parameterize 
each plot of A vs. SAL by the frequency groupings Low, 
Medium and High. The linear regression fits were 
determined for each frequency group and each 
accelerometer location, as follows: 


Model #1: 
_ * 
Location predicted © Di reina SAL, cation All a By Location an (3) 
Model #2: 
pe * 
Location Low_predicted 7 b, tocation. Low SAL scation Low ste Bo tocation. Low (4) 


= x 
Arocation Med_predicted Bi Location Med SAL vocation Med Y Di ocacion: Med (5) 


A =b * SAL -b (6) 


Location _ Hi _ predicted 1_ Location _ Hi Location _Hi © Y0_ Location Hi 


where Model #1 is the fit for the data points of all 
frequencies combined, for a given accelerometer location, 
and Model #2 is the set of fits for the data points grouped 
according to frequency of phonation, either low, medium 
or high, for a given accelerometer location. 

By fitting all subject data sets to the above models, b, 
coefficients (slopes) that were significantly different from 
zero could be obtained for most, but not all of the 
accelerometer location/frequency group data points. The 
test of non-zero slope is statistically the same as the test 
that the correlation coefficient r of the regression model 
is not equal to zero; i.e. that the linear regression equation 
is a valid representation of the relation between SAL and 
A. This was the consistently the case for all subject data 
sets for the low frequency data, at all accelerometer 
locations. Subject/set M01-1 also had non-zero b;’s for 
the medium frequency data, and subject/set M01-2 had 
non-zero b,’s for all three frequency groups. For this 
subject/set, statistical analyses showed that there were 
significant differences among the slopes of the fits for the 
three different frequencies at each accelerometer location, 
and that there were significant differences among the 
different locations. Furthermore, there was significant 
interaction between the effects of accelerometer location 
and frequency of phonation for this subject/set, in that 
there was a wider variation among the slopes of the 
different locations at low frequencies, but less variation 
among the slopes of different locations at medium and 
high frequencies of phonation. Looking only at the low 
frequency data for this subject/set, it was further found 
that certain pairings could be made, statistically, between 
the left and right counterparts of each location, which 
says that there is little difference between a left-right pair 
in the six locations around the thyroid prominence. Also, 
though there is not enough statistical evidence to 
distinguish between these six locations, taken as a group 
they are significantly different from the seventh location, 
the jugular notch. 

Visual inspection of the b; coefficients for the same 
subject in the two different trials showed no consistency, 
even though the repeatability of absolute measurements 
of vocal fold amplitude of vibration with the two-point 
laser projection system and videostroboscopy had been 
shown in an earlier study [4]. 


IV. DISCUSSION 


Mechanical models I 


The reason for the lack of intra-subject repeatability 
may have to do with the mechanism by which vocal fold 
vibration is transferred through tissue and measured as 
skin acceleration, and further investigation is needed. The 
lack of significant correlation between SAL and A at 
higher frequencies and in falsetto production may be due 
to the changes in vocal fold length, stiffness, and depth of 
vibration which characterize these types of phonation. 
The amplitude of vibration may not be adequately 
described by a linear model, and the “error” of the 
estimate may come not only from measurement error but 
also from the effects of unmeasured variables or un- 
included predictors, such as stiffness and length. Also, a 
two-dimensional measurement of horizontal amplitude of 
vibration does not describe the movement of tissue in the 
inferior-superior direction, which may contribute to the 
acceleration measured on the neck. 

For the subject/set M01-2, the signals from the seven 
accelerometer locations all provided significant 
information for predicting the vocal fold amplitude of 
vibration. A principal component analysis may allow one 
to determine the relative amount that each signal 
contributes to the prediction, and if a subset of the signals 
can provide a reasonable estimate. 


V. CONCLUSION 


The current data set shows that there may be a 
significant correlation between SAL and A at lower 
phonation frequencies, 1.e., at habitual speaking pitch. 
This relationship may hold if other parameters of vocal 
fold vibration, such as length and stiffness, are isolated or 
held constant. Further investigation is needed with more 
subjects and more repeated measures per subject. 
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Figure 1. (a) Two-point laser device mounted on a rigid 
endoscope. (b) The accelerometer array patch. 
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Figure 2. Accelerometer locations on anterior neck. 


P E i 
a 


fregi LS 
1 = sde j Ss a 
+ = ha dani lima 


È “= a IT 
= ie + y F 


Kafa 
Mia Y 

m miti 
TT 
PTT E 


tose arer 


ma a eee am 


Figure 3. Experimental setup. 


Improved fold closure in mass-spring 
low-dimensional glottal models 
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Abstract: This work presents a low-dimensional physical 
model of the glottis in which a 2-D fold displacement 
representation allows to represent both the vertical and 
longitudinal displacements of the folds. We use a one-mass 
mechanical model, coupled to aerodynamic driving forces, 
and we use a delay line representation to account for the 
propagation of the displacement on the body-cover. The 
waveform is characterized by means of a set of acoustic 
parameters (open quotient, speed quotient, return quotient, 
fundamental frequency Fo, etc.) that are used in the literature 
as typical voice source quantification parameters. The paper 
provides comparisons between values of these parameters 
computed for the proposed model and for analytical models 
(LF) of the flow. 

Keywords: Voice source, Low-dimensional models, Voice 
source parameters, Voice quality 


I. INTRODUCTION 


Low-complexity physical models based on the one- and 
two-mass paradigm have demonstrated to possess desirable 
properties: they are computationally efficient and stable, 
they offer physically justified control for basic glottal flow 
cues, and they can reproduce modal and non-modal phona- 
tion modalities for generating a wide range of phonatory 
styles and voice qualities [1], [2], [3], [4]. 

An open issue concerning simplified physical models 
of the vocal apparatus is that not always they allow to 
reproduce all the possible configurations and patterns of 
oscillation which can be observed in actual glottal flow 
waveforms. In particular, mass-spring models such as the 
classic Ishizaka-Flanagan (IF) model [5] are often charac- 
terized by unrealistic behavior in the closing phase due to 
very crude folds collision representations, and the abrupt 
closure often negatively affects the perceptual result of the 
synthesis. Smooth closing patterns are usually observable 
in inverse-filtered glottal waveforms, see Fig. 1, and are 
considered in many non-physical models (e.g., the well 
known Liljencrant-Fant (LF) analytical representation [6]). 

This work focuses on a low-dimensional physical model 
of the glottis. We use a one-mass mechanical model, 
coupled to aerodynamic driving forces. We introduce a 2- 
D fold displacement representation, in order to be able to 
represent both the vertical and the longitudinal displace- 
ments of the fold through delay lines taking into account 
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Fig. 1. Panel a): real glottal flow waveforms obtained by inverse filtering; 
panel b): glottal flow synthesis from an implementation of the IF model. 
The typical abrupt closure shape is highlighted. 


the propagation of the displacement on the body-cover. 

The paper is organized as follows. Section II gives an 
overview of the voice production model under investigation 
and presents the details of the refinements proposed. In 
Section III, some experimental results are presented and 
some properties of the new model are discussed by com- 
paring it with the LF analytical model. In Section IV the 
conclusions are given. 


II. METHOD 


The glottis model adopted here is a low-dimensional 
body-cover model in which the lower edge of the folds is 
represented by a single mass-spring system È, r, m and the 
propagation of the displacement is represented by a delay 
line of length T [1], see Fig. 2(a). The structure is a one- 
mass model with a propagation line aimed at simulating 
the propagation of the motion along the thickness of the 
fold, in agreement with the body-cover model proposed by 
[7]. A second-order resonant filter represents the oscillating 
fold, a simplified and an impact model reproduces the 
impact distortions on the fold displacement and adds an 
offset xo (the rest position of the folds). 

The areas at entry and exit of the glottis can be respec- 
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Fig. 2. Low-dimensional body-cover model of the vocal folds. Panel (a): the vertical displacement of the fold modeled through a single propagation 
line; panel (b): the vertical and longitudinal displacements of the fold are modeled through a two propagation lines. In both panels, from bottom to top, 
P, is the lung pressure, Pm is the driving pressure acting on the vocal folds, m, k, and r represent respectively the mass, stiffness, and damping of the 
fold, T represents its thickness, x1 and x2 are the fold displacements at entrance and exit of the glottis, and P; is the pressure at entrance of the vocal 


tract. 


tively defined as 


ay (t) = 2L(xo1 DI x1(t)) (1) 
a2 (t) = 2L(xo2 + x1(t) = TEL (t)) 
= 2L(xo2 T xa(t)), (2) 


where L is the length of the folds, zo} and x02 are the 
rest positions of the fold at entrance and exit to the glottis, 
and T = T/c; (cf being the wave velocity on the fold 
surface) is the time taken by the wave to propagate from 
the entrance to the upper end of the glottis. The glottal 
area is finally modeled as the minimum cross-sectional 
area between the areas at lower and the upper vocal fold 
edge, i.e., a = min{a1, a2}. A detailed description of the 
aerodynamics of the model can be found in the referenced 
papers [2]. 

We develop here an extension to this one-delayed mass 
model by allowing the layer to propagate along two di- 
rections: 1. the vertical axis, and 2. the horizontal axis. 
The scheme is loosely inspired to the 16-mass model 
introduced in [8], in which an array of two-mass-spring 
systems is organized longitudinally in order to represent 
horizontal differences along the length of the cord, and the 
longitudinal propagation on the body-cover of the fold. The 
proposed model is shown in Fig. 2(b). The areas at entry 
and exit of the glottis should be now computed taking into 
account that the displacement may be not constant along 
the longitudinal axis: 


L 
a1(t) = 2 | (x01 “Ti x1(1, t))dl (3) 
0 
L 
a(t) = 2 [rama —— ) 
0 
where x1,2(1,t) = x1,2(0, t) E TT a(1,t), T= L/c;. 


TIT. RESULTS 


Let us characterize the glottal waveform by means of 
a set of voice source parameters, allowing us to better 
evaluate how the new longitudinal displacement parameter 
affects the shape of the glottal flow pulse. Figure 3 shows 
the time instants usually defined for a glottal cycle, referred 
to an LF model. 

Figure 4 shows three simulations performed with the 
proposed low-dimensional body-cover model. The parame- 
ter 7, controlling the displacement delay on the longitudinal 
axis is gradually increased from left to right. It can be 
noticed that changes in this parameter mainly affect the 
closing phase of the glottal cycle. More precisely, the 
return time, 1.e. the time interval between the minimum 
of the flow derivative at time instant te and the closing 
instant te (see Fig. 3), scales with 7. 

Typical voice source quantification parameters extracted 
from the flow and the differentiated flow are direct ones, 
such as P (the glottal cycle period), Fo = 1/P (the 
fundamental frequency of oscillation), to (the opening 
instant), tp (the maximum flow amplitude instant), te (the 
negative peak instant), t. (the closing instant), and derived 
ones, such as the speed quotient SQ, the open quotient 
OQ, the opening quotient OingQ, the closing quotient 
CingQ, the return quotient RQ. For our discussion we 
focus on the following ones, which are among the most 
used in the literature [9]: return quotient RQ = (te—te)/P, 
open quotient OQ = (te — t0)/P, and speed quotient 
SQ = (tp — to)/(te — tp). The return quotient is directly 
related to the return phase duration, the open quotient is 
directly related to the duration of the open glottis interval 
that precedes the closure instant, and speed quotient is a 
measure of the ratio of the open phase to the closing and 
return phases. Most of these cues have been recognized 
to be particularly relevant for the study of the perceptual 
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Fig. 3. Glottal flow parameters referred to the LF model: time of glottal 
opening to; time and value tp,E; of flow maximum; time and value te, E. 
of flow derivative minimum; time of glottal closure tc; glottal period P. 


influence of the voice source characteristics, and for com- 
paring different voice qualities (e.g., [10], [11]). 

Analytical models, such as the LF, are widely appre- 
ciated due to their effectiveness in controlling the voice 
source parameters. One of the advantages is the possibility 
of controlling each source parameter by acting on a well 
identified analytical parameter. The return phase is typi- 
cally easier to control in analytical models than in physical 
ones, where the simplifications in the representation of 
the folds collision usually results in an abrupt closure, 
corresponding to RQ = 0. We focus on this aspect and 
compare the improved low-dimensional physical model 
with the LF class. To this aim, a set of 9 glottal flow 
waveforms was generated both by an LF model and by 
the proposed model. In both cases the parameter related 
to the return phase was increased for each next run. The 
result of the simulations and of the computation of the 
voice source parameters is shown in Fig. 5. 

The two sets of waveforms are characterized by same 
period length, and approximately same OQ values. Due 
to the differences in the shape of the pulse of the two 
models, it was more difficult to obtain similar values for 
the SQ parameter. The first thing that can be observed 
by comparing panels a) and b) of Fig. 5 is that in both 
models the parameter used to control the return phase does 
not produce appreciable pitch variations. If this property 
is an obvious one for analytical models, in which the 
period length is analytically imposed, the same behavior 
is not necessarily granted for a physical model, in which 
each component in the dynamic loop may potentially 
affect the stability and the frequency of oscillation. It has 
been observed, for instance, that changing the length T 
of the delay line representing the thickness of the fold, 
may affect only the duration of the closed phase in some 
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Fig. 4. Result from numerical simulations performed with the proposed 
low-dimensional body-cover model. Values of 7; increase from panel (a) 
to panel (c). 


circumstances, or both the duration of the closed phase 
and of the overall period (i.e., it may affect the pitch). In 
all the experiments conducted on the proposed model, no 
appreciable pitch variations were observed in response to 
variations of the parameter 7; representing the length L of 
the fold. 

The comparison of panels a) and b) of the same figure 
leads to other considerations. It can be seen that, as the 
control parameters P,y and 7, are raised, the behavior of 
the three source parameters considered here, RQ, OQ, and 
SQ, is qualitatively the same: RQ increases as expected 
in both models, even if with this configuration of the low- 
dimensional glottal model it was not possible to reach 
the same values around 0.5 obtained with the LF one 
(instability of the oscillation was observed if the param- 
eter 7, was further increased). Such high return quotient 
values are however rarely observed in natural glottal flow 
recordings. The OQ parameter is approximately constant 
as expected (note from the definition that the return phase 
does not contribute to the open quotient OQ), except in the 
right-most part of the plot, where the curve rises slightly. 
Finally, both curves representing the SQ parameter show 
a decreasing trend, although the range spanned by the plot 
related to the LF model is appreciably larger than the range 
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Fig. 5. Glottal flow parameters computed on a set of 9 waveforms obtained by running the LF model (panel a)) and the proposed low-dimensional 


body-cover model (panel b)) while increasing the parameter responsible for the duration of the return phase (parameters on the x axes are normalized). 


spanned by the plot related to the low-dimensional model. 
Apparently, there is no straightforward explanation for this 
inequality, apart from the evident differences in the pulse 
shape. 


In conclusion, some interesting properties of this class 
of physical models emerged from the comparison with an- 
alytical models, and further investigation will be conducted 
in this direction. 


IV. CONCLUSIONS 


We proposed an extension to the mechanical compo- 
nent of a low-dimensional vocal fold model previously 
introduced, and we discuss the affectiveness of the new 
scheme in terms of control of glottal flow cues pro- 
viding comparisons with the LF analytical model. The 
additional degree of freedom introduced with this new 
scheme allows to control some relevant features of the 
glottal flow waveform, such as the return quotient, that 
are not directly accessible with similar models previously 
proposed in the literature. Future research on this class 
of models is foreseen with respect to a number of issues, 
including: 1. the perceptual assessment of the synthesis to 
gain understanding on the perceptual relevance of the new 
parameters in terms of naturalness of the synthesis and of 
voice quality controllability, 2. the refinement of the low- 
dimensional model to adapt its glottal pulse shape to the 
characteristics of the LF model, thus allowing improved 
comparisons between the two classes of models, and 3. 
the design of automatic parametric adaptation algorithms 
to fit the model to real glottal waveforms. 
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NUMERICAL SIMULATION OF AIRFLOW THROUGH THE OSCILLATING 
GLOTTIS 


P. Punéochayova', J. Horáček’, K. KozeF, J. Fürst! 
'Department of Technical Mathematics, Faculty of Mechanical Engineering, Czech Technical University in Prague, Czech 
Republic 
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Abstract: The work deals with the numerical solution of 
2D unsteady compressible viscous flows in a symmetric 
channel for a low inlet airflow velocity. The 
unsteadiness of the flow is caused by a prescribed 
periodic motion of a part of the channel wall with large 
amplitudes, nearly closing the channel during the 
oscillations. The flow in the channel can represent a 
simplified model of airflow coming from the trachea, 
through the glottal region with periodically vibrating 
vocal folds to the human vocal tract. 

Keywords: Navier-Stokes equations, unsteady 
compressible viscous flow, FVM, ALE method, CFD. 


I. INTRODUCTION 


The fluid-structure interaction problems can be met in 
many technical and others applications. This study 
presents the numerical solution of the unsteady 
compressible viscous flows in a symmetric channel, 
which is a simplified model of the glottal spaces in the 
human vocal tract. In reality, the airflow coming from the 
lungs causes the vocal folds self-oscillations, and the 
glottis is completely closing in normal phonation regimes 
generating acoustic pressure fluctuations. In this study, 
the changes of the channel cross-section are prescribed; 
the channel is harmonically opening and nearly closing as 
a first approximation of reality enabling the investigation 
of the airflow field in the glottal region. 

Here, we present the results for the frequency of 
periodic oscillation 100 Hz and uniform inflow air 
velocity with the Mach number M.=0.012 at the channel 
inlet. When the glottis is closing the airflow velocity is 
becoming much higher in the narrowest part of the 
airways, where also the viscous forces are important. 
Therefore for a correct modelling of a real flow in the 
glottis, the compressible, viscous and unsteady fluid-flow 
model should be considered. 

The authors present the numerical solution and the 
simulations of the flow field in the human larynx airways 
performed by the especially developed program. 


Il. GOVERNING EQUATIONS 


Mathematical model: The 2D system of Navier-Stokes 
equations in conservative non-dimensional form was used 
as mathematical model to describe the unsteady laminar 
flow of the compressible viscous fluid in a domain [1]: 


W, +F, +G, =¿(R,+S,) 2 (1) 


where W=|p,pu,pv,e} is vector of conservative 
variables, F and G are the vectors of inviscid fluxes, R 
and S are the vectors of viscous fluxes, Re=(2h'p'.u'»)/1 ‘0 
is Reynolds number given by inflow variables marked by 
infinity subscript (dimensional variables are marked by 
the prime), p denotes the density, u and v are the 
components of velocity vector and e is total energy per 
unit volume. The static pressure in F, G is expressed by 
the equation of state: 


p=(k- leio], 


where x=1.4 is Poisson constant. The non-dimensional 
dynamic viscosity in the dissipative terms of equation (1) 
is the function of temperature: n = (T/T) ^. 

Mathematical formulation: The computational domain 
D is a scale model of channel which shape is inspired by 
a shape of the vocal folds and supraglottal spaces as 
shown in Fig. 1. The computational domain is only the 
lower half of the symmetric channel. The upper boundary 
is the axis of symmetry, the lower boundary is the 
channel wall a part of which, between points A and B, is 
changing the shape according to a given function of time 
and axial coordinate: 


. | 3T Xx, hala 
w(x, t)=(a,+a,)- in| 77 a. H1|+d,xE(x,,xc) (3) 
x-Xx 
w(x ,t)=2(a,+a,)-cos ue “+d, xE(xc,x}) 
X87Xc 


a,=a sin (2r f-t), t€(0,271); a,=0.18, a,=0.015 , 
where /=5.83-10* is dimensionless frequency. The gap 
between the point C and the channel axis is g=(d+h)- 
w(xc,î). The considered dimensions of the domain D are 


shown in Tab. 1. 
Ë 
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Fig. 1. Computational domain D. 


A simplifying assumption is used that during the 
normal phonation the vocal folds oscillations are 
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symmetric and that the flow in the glottal region is also 
symmetric. 


Tab. 1. Dimensions of the computational domain D 


coordinate 

name 

x [-] y [-] x [mm] y'[mm] 
A 1.75 0.4 35 8 
B 2.4 0.4 48 8 
C 2.3 w(xc, f) 46 w(xc, £):20 
Emin - 0.01 - 0.2 
Emax - 0.07 - 1.4 
L 8 - 160 - 
d - 0.4 - 8 
h - 0.4 - 8 


III. NUMERICAL SOLUTION 


Numerical method: The numerical solution uses finite 
volume method (FVM) in cell centered form on the grid 
of quadrilateral cells. Due to the unsteady domain the 
integral form of FVM is derived using the Arbitrary 
Lagrangian-Eulerian (ALE) formulation. ALE method 
defines homeomorphic mapping of reference domain Dy 
at initial time to a domain D, at t > 0 [2]. 

Numerical scheme: The explicit MacCormack (MC) 
scheme in the predictor (4a) corrector (4b) form in the 
domain with moving grid of quadrilateral cells is used for 
the numerical solution of the system (1). The scheme is of 
the 2™ order of the accuracy in time and space [1]: 


n+1/2 _ |D; | n 
1,J DIS | tJ 


4 
ee (Fiat 


DI; k=1 


A 


Ray 0 


L 
Re 


yy ntl_ |D; | 1 n n+1/2 
animo LATI | 
4 
At 7 n+1/2 n+1/2 | pn+1/2 
“spazi T Sik W, LR Ay, (4b) 
tJ 


px n+1/2 n+1/2 1 ©n+1/2 
|G; sa W y — Si 


Ax 


At is time step, |D; is volume of sub-domain D;; in ij 
position (see Fig. 2) and Ax, Ay are steps of the grid in x, 
y directions. The approximations of the convective terms 
sW; and the numerical (marked by tilde) viscous fluxes 
R,, S, on edge k are central and the vector s=(s1,52) 
represents the speed of the edge k (see Fig. 2). The higher 
partial derivatives of the velocity and the temperature in 
R,, 5 x are approximated using dual volumes Vk (see 
[1]) as shown in Fig. 2. The inviscid numerical fluxes are 
approximated by the physical fluxes as follows: 
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DNn_ pn è n+1/2__ n+1/2 n_ o 

F =F, y F, =F ivy F3=3F,,; (5) 
3 n+1/2 n+1/2 n n An+1/2 n+1/2 
F, Fi; o G5Gi p G “=G; pe 


The last term used in MC scheme is the Jameson artificial 
dissipation term AD(W';;) [3, 4]. Then the vector W is 
computed at a new time level 7": 


WII=W 7 + AD(W; ) (6) 


Grid: Fig. 3 shows the grid in part of the channel at 
two time levels (at minimum and maximum of the gap). 
The minimum cell size in y - direction is A y yi, 1/ VRe 
to resolve capture boundary layer effects (see the detail in 
Fig. 3, the refinement cells near the wall). The 
computational domain contains 450x50 cells. 


ual volume Y 


i 


Fig. 2. Finite volume D;; and the dual volume Vx. 


IV. NUMERICAL RESULTS 


The numerical results were obtained for the following 
input data: Mach number M,=0.012 (u.=4.1 m's”), 
par1.0 (p=1.225 kgm”), 4.=1/Re (n'.=1.5-10° Pa's), 
Re=5237 and atmospheric pressure p:=1/x (p'=102942 
Pa) at the outlet. 

The computation of the unsteady solution was carried 
out in two stages. Firstly the steady solution is realized, 
when channel have rigid wall in middle position of the 
gap g=0.04 (0.8 mm). Then the steady solution is used as 
initial condition for the unsteady simulations. 


A. The steady solution 

Fig. 4(a) shows the steady numerical solution. Results 
are mapped by iso-lines of Mach number, by streamlines 
and also by velocity vectors. The maximum of Mach 
number computed in the domain is Mma=0.173 at 
x=2.317 on the axis. Fig. 4(b) shows convergence to the 
steady state solution computed using the L) norm of 
momentum residuals (pu). The convergence seems to be 
satisfactory for this very sensitive and complicated case. 
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proportionally (see sub-captions of the Figs. 4 and 5(a)- 
(f)), where the last numbers denote the separation 
parameter: 


={(h+d)-w(x,,1)}/g , (7) 


which is the ratio of channel high at the separation point 
xs and the gap g at x=xc. 

Fig. 6 shows the detail of the point of flow separation 
of the instant shown in Fig. 5(b). The flow separation in a 
narrow divergent channel was predicted in [5, 6] to occur 
at the point where the glottal-width (2g) exceeds the 
minimum glottal width by a fixed amount (10% or 20%), 
i.e. for s=<1.1; 1.2>. Our results of the numerical 
simulations show that the separation parameter can 
exceed values 8.5 when the gap is close to minimum. 


Wa AE aaa 44 dd Pe e Sr a 7a 4 a (a)t 6T, g 0.04, Mna=0.154 at x=2.309, x,=2.309, s=1.089 


Fig. 3. The grid of the quadrilateral cells in part of the 
channel at two time levels: at minimum gap gmin (on top) 
and at maximum gap gma (at the bottom ). 
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(a) Mna=0.173 at x=2.317, g=0.04, x,=2.313, s=1.186 


e ninot Est 


(b) Convergence to the steady state solution 


Fig. 4. The steady numerical solution - M,=0.012, 
Re=5237, pr1/k, 450x50 cells. 


B. The unsteady solution for frequency 100 Hz z y i 
The unsteady solution in the fourth period of the wall (f) 82, g=0.04, Mna=0.154 at x=2.309, x=2.309, s=1.089 


oscillation is shown in Fig. 5 at several time layers. The ve nr 
highest maximum of Mach number was achieved in vet ei od 
instant when the glottal width is opening after the Fig. 5. The unsteady numerical solution for wall motion - 
minimum of the gap is exceeded (see Fig. 5(c)) in time f=100 Hz, M.=0.012, Re=5237, p=1/x, 450x50 cells. 
(=6n+0.84r (1'=0.0342 s). In this instant the point of flow Results are mapped by iso-lines of Mach number, by 
separation on the wall is x,=2.320. The points of flow streamlines (lower part of the channel) and by velocity 


separation depend on the width of the gap g inversely vectors (upper part of the channel). 
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Fig. 6. Detail of the flow at the separation point - 
67+7/2, gmin=0.01, x=2.333, s=8.579. 
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Fig. 7. Dimensionless gap g, Mach number and pressure 
at x=2.3 on the channel axis in real time 1£' - f=100 Hz, 
M,=0.012, Re=5237, p,=1/x, 450x50 cells. 
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Fig. 8. Mach number along the channel axis in several 
time instants during the fourth oscillation period. 
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Fig. 7 shows the changes of the gap g, Mach number 
and the pressure in real time at the distance x=2.3 on the 
channel axis. The phase shifts between the minimum 
glottal gap g and the maximum of Mach number and 
pressure fluctuations are about 1.7:10* s and 7.8:10* s, 
respectively. It can be also seen that the flow becomes 
periodical after the first period of the oscillations. 

Fig. 8 shows the Mach number along the axis of 
symmetry of the channel in several time instants during 
the oscillation period. Behind the narrowest channel 
cross-section (x=xc) a second peak of the Mach number is 
forming which travels as a dying wave to the outlet. 


V. SUMMARY 


The numerical method and the special program code 
solving the 2D unsteady Navier-Stokes equations for the 
viscous compressible fluid has been developed. The 
method has been used for the numerical solution of the 
airflow in a simplified model of the human vocal tract 
geometry. Even if no complete closure of the glottis is 
modeled, the numerical simulation of the airflow field in 
the glottis is complex and relatively close to reality. 

Future tests of the method in modeling of the flow in 
the human vocal tract will be focused on narrowing the 
minimum glottal-width (2gmin<0.02), lowering the inlet 
flow velocity and the geometry of the channel will be 
closer to a real geometry of the glottis and the vocal tract. 
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ADVANCED VOICE ASSESSMENT. 
A prospective case-control study of jitter%, shimmer% and Qx%, glottis closure cohesion factor (Spead by 
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1. Introduction 


It was suggested at the Nuropean Oto-Uhino-Laryngology 
conference 2007 in Vienna that voice analysis is empiric and 
that clinical voice treatment is not evidence based!! In the 
Cochrane Handbook [1] advice for evaluation of quality of 
research was made: Liroups are made of the quality in 3 
levels: Level A (randomized controlled trial/meta-analysis): 
High-quality randomized controlled trial (C0) that considers 
all important outcomes. High-quality meta-analysis 
(quantitative systematic review) using comprehensive search 
strategies. Level B (other evidence): A well-designed, non 
randomized clinical trial. A non quantitative systematic review 
with appropriate search strategies and well-substantiated 
conclusions, includes lower quality (ICLs, clinical cohort 
studies and case-controlled studies with non biased selection 
of study participants and consistent findings. Other evidence, 
such as high-quality, historical, uncontrolled studies, or well- 
designed epidemiological studies with compelling findings, is 
also included. Level C (consensus / expert opinion): 


Consensus viewpoint or expert opinion. 


he purpose of this categorization is that good studies can be 


structured in meta-analysis to affirm the results as it is done in 
e.g. cancer and cardiology research. 

In our two Cochrane reviews on vocal nodules [2] and laryngo- 
pharyngeal reflux [3] no clinical evidence based studies were 
found neither for the treatment of vocal nodules nor laryngo- 
pharyngeal reflux. In the review of vocal nodules 659 papers 
were evaluated, and in the review of laryngo-pharyngeal reflux 
302 papers. (he problem most commonly found, was lack of a 
clear baseline for inclusion in the studies, and, lack of 
unanimous objective visual and acoustic criteria. 

Uherefore we have in a part one of this prospective case- 
control study [4] first, tried to make a defined baseline of a 
complaint of a non-functioning larynx, second, to standardize 
simple object visual demands for larynx mucosa including the 
vocal cords but based on oedema of the arytenoids, third, to 
evaluate the measures of jitter percent, shimmer percent in 
relation to the closed phase percent of the vocal cords. 
Evidence of pathological parameters were defined for 
sustained tons as well as the reading of a standard text, Cable 
1, difference was also found from before to after treatment, 
Cable 2, treatment as ealier discribed [5] 

As part two we used the same patients material, for all with 


sufficient data, in the same prospective controlled case- 


controlled setup, for two still more advanced objective throat 
function analysis: the Cohesion Factor of irregularity as defined 
in the Spead program by Laryngograph Ltd. illucidating 
kymographic aspects and Long Lime Averaging Spectrum 
(LAS). 

Method 


I. Inclusion criteria were a. subjective complaints of a non- 
functioning larynx combined with b. a professional assessment 
and visual score grouping the patients by swelling in the 
arytenoids +/- pathological vocal cords. Patients without 
swelling of the arytenoids and with normal vocal cords were 
rated normal, score 1 by visual inspection. Patients with 
swelling rated from 2 to 5 were abnormal. [here are individual 
variations but a normal video-stroboscopy includes a normal 
surface of the arytenoids without oedema and a normal shape, 
as well as normal colour and movement at stroboscopy of the 
vocal cords and all the rest of the mucosa of the larynx. Fig. 
1A, normal, score 1, and Fig. 11 and C, abnormal scores 


(score 3 and 5 presented). 


II. Che parameter: the closed phase of the vocal cords defines 


the exact point where the vocal cords meet in the synchro- 
nized glottography with stroboscopy [6]. Chis is difficult to see, 


if there is oedema of the arytenoids or of the whole larynx 


mucosa. [he closure of the vocal cords (Qx%) and the 


fundamental frequency (Fx%) can under those circumstances 
be compromised even if the vocal cords themselves have 
movement. Lhe whole larynx can be affected due to infections, 
allergy, reflux and misuse etc. [5]. esting binary equal 
movements of the vocal cords related to the total amount of 
movements gives a Cohesion Factor of irregualrity (Spead by 
Laryngograph Ltd.) for Qx% and Fx% analyzed for a 
sustained tone for 4 seconds and reading of a standard text 
(“the north win and the sun”). Fig. 2. Che abnormality degrees 
of the arytenoids with visual scores of 4 is shown before and 
after treatment. 

III. Che clinical use of harmonics including formants was 
empiric in pathology till now. Chis patient material analysed for 
the cohesion factor was also analysed for Long Lime Average 
spectrograms (LUAS), for a sustained tone /a/ for 4 seconds 
and a standard text (“the north win and the sun”). Che problem 
was to point out the maximal intensities in pathology especially 
related to formants, and the change related to treatment. 

Fig. 3a shows the normal LUAS during reading the north win 


and the sun, of 35 persons with normal larynx, score 1, 
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including normal arytenoids, the measurement taken from 
Spead by Laryngograph Ltd. and placed in an Excel sheet. 
The curves where extracted from individual sheets, harmonics 
where measured individually on Multi Dimensional Voice 
Profile system by Key Elemetrics and compared up to 
12.000Hz. 

The statistics where based on SAS JMP (survival analysis) of 
the huge amounts of data. 3b shows the curves of 301 patients 


with a visual score of deviant arytenoids form of 2-5. 
Results 


Table 3 shows the cohesion factor of Qx%, statistical analyses: 
Cohesion factor % for 35 normals and 301 abnormals as 
defined by oedema of the arytenoids and related pathological 
mucosa. 

Among others a significant difference was found for Qx% and 
standard deviations between normal and abnormal measures, 
Welch ANOVA p<0,0001 for sustained tone. 

Analysis of Long Time Average Spectrograms (LTAS) showed 
no overall difference between the pathological video - 
stroboscopies Overlay Plot and the normals, but for the area 
between 2500 and 4000 Hz Table 4. 


Discussion 


It has been shown that jitter% and the closed phase % Qx of 
the vocal cords are better and evidence based, in a 
prospective case-control study and in a prospective cohort 
study, related to medical treatment of pathological changes of 
the larynx including the arytenoid regions, - not only of the 
vocal cords. 

A differentiation can be made of whether the primary tone 
generator (including the arytenoids, the mucosa and the vocal 
cords) or the more coordination related factors of sound 
making should be focused upon in medical treatment. The 
cohesion % is significantly better in tone and text after 
treatment. In the LTAS the area of 2500 to 4000 Hz has a 
significantly higher value in dB after treatment when reading a 
standard text. 

It was earlier shown that phonetograms are better after 
medical treatment [5]. So now we have evidence based 


measurements for the future treatment of voice disorders. 


Fig. 1A Fig. 1B 
Score 1 Score 3 
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Conclusion 


The new parameter, the Irregularity % or cohesion factor 
between all measured signals -and pairs of successive 
vocal cycles that fall into the same analysis bin in the 
histogram, has been presented as evidence based in a 
clinical setting in a prospective case — control study, and 
a cohort study before and after treatment. Normal values 
and values after treatment are given. On the same material 
the LTAS in the area of 2500-4000 Hz has been shown to 
be of evidence based value in a clinical setting in the case 
— control study as well as the cohort study before and 
after treatment, - with higher intensity values in normals 


and after treatment 
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A: 


arytenoids 
shape 1 
shape 2-5 
statistics 
B: 


arytenoids 
shape 1 


shape 2-5 


statistics 


mean mean 
jitter% Std Dev shimmer% 
1 9,2 


mean 
Qx% Std Dev N 
47,1 6,5 35 


Std Dev 
6,5 


Comments 


4 8,2 6,6 45,3 12,7 338 
significant difference for Qx% and standard 
deviations between normal and abnormal 


measures, Welch ANOVA p<0,0001 


loudness 
variation% 
15,4 


frequency 
variation% Std Dev 
9 6,9 


Std Dev Qx% Std Dev N 
9,1 48,7 6,5 


normals SD 
35 
<6,9 abnormal> 11,1 
11,1 5,6 338 
normals SD for 
Qx% <6,5 
abnormals >11.4 
*p as given (Wilcoxon test) 


46,0 11,4 


p 0,03 * p 0,011 * 


A: 77 patients with examinations before and after treatment, 
intonation of a sustained tone /ah/. 


arytenoids 
abnormality 


shape 4 


mean jitter% 
mean shimmer% 7,4 
mean Qx% 


shape 3 


mean jitter% 


(shape 5 1 pt.) (shape 5 3 ptt.) 
2. examination Std Dev N 1° 32/ 2nd.25 
1,1 
3,7 
6,1 


1. examination Std Dev 
5.7 


43,7 


1.examination 
3,8 


mean shimmer% 7,4 


mean Qx% 


shape 2 


mean jitter% 


42,3 


1.examination 
4,9 


mean shimmer% 4,9 


mean Qx% 


Arytenoids score 4, Fx % cohesion factor before treatment 


r 
. 


Arytenoids score 4, Qx % cohesion factor before treatment 


50,3 


45,4 : 
(shape 1 1 pt). 


(shape 1 2 ptt.) 


after treatment 


(CAT 
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Table 1 

Groups of consecutive 
digitized videostroboscopies 
evaluated by 2-3 observers on 
the spot, and voice analysis at 
the same time of normal 
controls: arytenoids shape 
grade1, without laryngeal 
complaints versus: abnormal 
clients with laryngeal 
complaints, arytenoids shape 
grade 2-5, measured with 
SPEAD by the firm 
Laryngograph Itd. 


for frequency variation 


A: sustained tone /ah/. 


B: reading of a standard text: 
the North wind and the sun. 


Table 2. 

statistics 

For Tone, no significant change was 
found of jitter% and shimmer% with 
paired t-test. 

For Qx% there was a significant better 
closure of the glottis of 4,6% (43,8% 
to 48,4%) 

with a significance of 0,0008 with 
paired t-test. 

For the reading of a standard text the 
regularity frequency% was 

reduced with 1,98% (p= 0,053), the 
regularity of loudness% with 1,7% 
(p=0,004) 

and the Qx% was better with a change 
of 2,56% (p=0.044) analysed with 
paired t-tests. 
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Fig. 3a shows the normals visual score 1 related to LTAS and 3b the abnormal arytenoids visual score 2-5 related to LTAS 
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Sustained tone Qx% Reading of a text Qx% Sustained tone Qx% Reading of a text Qx% 
rytenoid 1 19 (12-26) range 35 (30-40) *p 0,042 << | before 17 (12-22) range 44 (40-48) p*0,015 < 
rytenoids 2-5 18 (15-20) range 41 (39-42) difference after 14 (9-19) range 37 (33-41) difference 


Sustained tone Fx% Reading of a text Fx% Sustained tone Fx% Reading of a text Fx% 
rytenoids 1 1,9 (1-6) range 13 (8-19) *p,03 + before 4.5 (1.8-7.2) 22 (19-26) 


rytenoids 2-5 5,3 (3,7-5,8) range 19(18-21) difference after 3 (0.3-5.7) 17 (14-22) 


Cohesion factor before and after treatment arytenoids score 2-4 
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Table 4. LTAS in normals with arytenoids score 1 vs abnormals with arytenoids score 2-5 
Table 5. LTAS Product-Limit Survival Fit Survival Plot group 2-4 before and after treatment showed a significant difference 


Overlay Plot group 2-4 before and after treatment LTAS. Product-Limit Survival Fit Survival Plot group 2-4 
before and after treatment showed a significant difference 
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PRE-POST SURGERY EVALUATION BASED ON THE PROFILE OF 
GLOTTAL SOURCE 
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Abstract: Nowadays an ever increasing interest in 
voice studies is present in the research and application 
fields both under biomedical and bioengineering 
scopes. Everyday more resources are assigned to this 
research field looking for new methods easing the 
study, evaluation and diagnose of voice pathology. In 
this sense it is well known that the mucosal wave is a 
fundamental phenomenon present in voice 
production, highly related to voice quality. In such a 
way when a specific pathology is present in vocal folds 
producing modifications in their dynamic model the 
amount of mucosal wave is sensibly altered. The 
present work uses results from inverse filtering to 
derive a mucosal wave correlate from the glottal flow 
derivative. Therefore two important estimates of the 
phonation pattern (glottal excitation) and the vocal 
fold behaviour (mucosal wave correlate) may be used 
in pathology detection. A clinical study case 
corresponding to the presence of a polyp on a single 
vocal fold (unilateral) is conducted to evaluate the 
pathological alteration produced on the dynamics of 
the vocal folds and on the presence of mucosal wave. 
The results illustrating the behaviour of the glottal 
closure and vocal fold dynamics obtained before and 
after treatment are given and discussed. 


Keywords : Mucosal Wave, Glottal Source, Polyp, 
Voice Pathology, Vocal fold dymanics 


I. INTRODUCTION 


Voicing sounds produced by humans may be defined as 
complicate pseudo-periodic signals resulting from the 
transmission of a pressure wave through a gaseous 
medium produced by the vibration of vocal folds (Glottal 
Excitation or Source) exposed to a spectral 
transformation as passing through suppraglottal organs 
(filtering) up to its emission through the lips (radiation). 
Using inverse filtering methods the glottal excitation 
(source) may be obtained from the residual left after the 
elimination of the vocal tract influence [1][2] (see Figure 
1). Vocal fold dynamics is directly related with the 
distribution of the different components of the histologic 
structure of vocal folds [3]. The organization of these 
components is known as the body-cover structure (see 
Figure 2). 
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Figure 1. Iterative inverse filtering methodology used in the 
estimation of the glottal excitation (source) 


The dynamic behaviour of the vocal folds may be 
reproduced to a certain extent using biomechanical 
equivalent models. The 3-mass model by Story and Titze 
given in Figure 2 is complete description of the vocal fold 
dynamics considering separately the components of body 
and cover structures [4]. 


ti fi" 


Figure 2. 


3-mass model (1 body mass + 2 cover masses). 


This model allows representing the mucosal wave 
associated phenomena which take place on vocal folds 
during phonation observable by stroboscopic inspection 
to a certain extent [5]. More precise descriptions could be 
obtained using more distributed masses to represent the 
behaviour of the fold cover. 


II. CASE REPORT 


Voicing from a 34-year old female, non-smoker, theatre 
actress, asking professional aid because of a four year 
vocal production limitation compensated with vocal over- 
effort was used in the study. The pre-surgery 
(pathological) and the post-surgery conditions were 
estimated from electroglottographic and video- 
endoscopic examinations (by a rigid 70° stroboscope) as 
well as from subjective GRBAS evaluation [6]. The 
images and recordings produced for the present study 
were obtained using the software MEDIVOZ. 
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The final results from the pre-surgery study of the vocal 
folds determined the presence of a gelatine-type polyp [7] 
(pointed by the arrow in Figure 3 left) affecting the free 
lip of the medial third of the left vocal fold, substrate- 
attached and mildly edematous. Contra-lateral lesions 
were not observed. The glottal closure produced during 
phonation was incomplete and the mucosal wave on the 
vocal cord affected was asymmetric and reduced. 


Figure 3. The left and right templates respectively show the 
conditions of pre- and post-surgery vocal folds in a gelatine-type 
polyp. 


The patient followed surgical treatment to excise the 
lesion and to re-establish the anatomical healthy 
condition of the vocal fold, respecting the vocal ligament, 
the anterior commissure and Reinke’s space, to case the 
vibration conditions of the vocal folds for the mucosal 
wave to be re-installed 


HI. METHODS 


3.1 Sample collection 

The voice recording protocol included three utterances of 
vowel /a/ with a duration not shorter than 3 sec. for each 
emission. Segments of 0.2 sec. were produced from the 
recording central parts for the analysis. These segments 
were processed to derive the average acoustic wave and 
the cover dynamic component [8]. 

3.2 Estimating the glottal flow derivative and the 
mucosal wave correlate 

The dynamic behaviour of the vocal fold cover may be 
described using a simplified 2-mass model if the average 
acoustic wave is eliminated from the glottal excitation, as 
the residual dynamics can be referred to the vocal fold 
body mass [9][10]. This new dynamic system will 
consider only the cover masses, related with the mucosal 
wave phenomenon. The modelling of the mucosal wave 
correlate will be described by the dynamic behaviour of 
each of the four masses (two per vocal fold cover) to 
external forces (induced by pressure differences) as a 


general equation of the kind: 
dvi) : / 
“if ij pij i,j F, EF ij ij ij j,i = 
Sa abi Mr di © Kr] | vd KE; fo 1 ~Y py Jdt=0 
-% 


ry 
-0 


where ie is the force acting over the sub-glottal (7) or 
supra-glottal (7) mass on the right (r) or left (7) vocal fold 
on the direction of the vocal fold movement (considered 
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normal to the larynx transversal section), — vet is the 
velocity of the mass considered M7, R;7 will be the 
loss parameter (viscosity and heat dissipation), X, will 
be the spring elastic constants of the springs linking the 
cover masses to the body mass, and KÏ ; will be the 
elastic constant of the spring linking both masses. This 2- 
mass model explains the behaviour of the mucosal wave 
observed when exploring the vocal folds in movement 
during phonation to a certain extent enough for the 
purposes of the present study, more complete descriptions 
requiring more complex models. The separation of the 
average acoustic wave from the cover dynamic 
component allows the differentiate study of both 
components, associated respectively to the movement of 
the body and cover masses [11] . 


3.3 Analyzing the glottal flow derivative and the 
mucosal wave correlate 

The analysis of the glottal source in the time domain 
allows the evaluation of the normal or non-normal 
phonation conditions depending on the resulting profile. 
For such the following singular points during the open- 
close phases of a glottal excitation with period given by T 
have to be determined as: return interval (T,=t,), closed 
interval (T,=to-t,), open interval (T,=terto) and closing 
interval (T.,=T-t,;) (see Figure 4). 
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Figure 4. Singular points in the opening-closing phases of a 
phonation cycle according to the L-F model. 


To evaluate these points the direct glottal flow derivative 
1s not used, due to the uncertainty about the point where 
the closure ends and the opening starts resulting from 
higher order vibration cycles superimposed on that signal. 
Instead, it may be shown that the mucosal wave correlate 
provide more reliable indications about where that instant 
is to be estimated. The derivative of the mucosal wave 
correlate gives also precise hints to localize the end of the 
return phase and the beginning of the closing phase. The 
estimation of the different points may be carried out as 
follows: 


Posters 


> T; Return Interval. After full closure the static 
pressure conditions (atmospheric) are re- 
established by reversal flow. This process is 
known as the return phase. It may be estimated 
from the first minimum on the derivative of the 
mucosal wave correlate. 


> T: Complete Closure Interval. The glottal 
pathway has been interrupted by full contact of 
both vocal folds at the supra-glottal end. The 
inertial behaviour of the flow of gas results in a 
sharp pressure decay, arriving to a minimum. 


> T: Open Interval . At the end of the closure 
phase the open phase starts at the subglottal 
end. Its starting may shown to be related with 
the minimum observed on the mucosal wave 
correlate. 


> Ta: Closing phase interval, the equivalent 
section between vocal folds having arrived to a 
maximum a decrease is initiated. Its start may 
be estimated at the intersection of the line 
extending from the end of the return phase with 
the glottal excitation. 


The amount of stress in the vocal folds and the mucosal 
wave energy may be estimated from the profile of the 
mucosal wave power spectral density [12]. 


III. RESULTS AND DISCUSSION 


The results obtained on pre- and post-surgery for the 
same patient are given in Figure 5. In the pre-surgery 
exploration voice was defined by the presence of 
roughness (tense and hoarse voice) and aerial leakage. 
These characteristics can be appreciated in the profile of 
the glottal excitation (upper template). The profile 
presents a short period of 7=4.5 msec, expressing the 
tension existing between vocal folds during phonation. 
Tense voice is usually associated with a diminishing of 
the mucosal wave amplitude/energy relative to that of the 
average acoustic wave, as it may be assumed that the 
highest the tension the less mucosal wave will be present. 
In the case studied the diminishing in the mucosal wave 
is not symmetrical as it affects most to the closure 
interval of the phonation cycle, where it is more intense 
during the return and open intervals (the closure interval 
appears to be extremely short, as the polyp produces 
incomplete closure). The profile of the glottal excitation 
is almost a reflected version of the Liljencrants-Fant (L- 
F) pattern [12] with respect to the vertical axis. This 
asymmetry may be due to the influence of the vocal fold 
affected by the polyp, where the mucosal wave appears 
much more diminished. The presence of the polyp in the 
medial third of the fold has produced a change in the 
histological structure of the fold, resulting in an 
increment in the lax conjunctive tissue on the fold cover 
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[13], which becomes a more rigid dynamic structure 
behaving as a single mass, i. e., the affected fold have 
switched to a /-mass structure. The healthy vocal fold 
keeps an intact histological structure as it appears that a 
contact lesion is not present, and the reduction in the 
mucosal wave is due to the tension associated to the 
pathological vocal fold. The resulting dynamics of both 
vocal folds diverges from the ideal 3+3-mass cover 
model to become a 3+/ one. 
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Figure 5. Estimating time intervals on the glottal source and 
the mucosal wave correlate. Top: pre-surgery. Bottom: post- 


SUrgery. 


A second characteristic of the excitation is the presence 
of aerial leakage. The polyp has originated a mass 
protruding in the glottal cleft where fold contact takes 
place during the closed phase, impeding an effective 
glottal closure and deriving a flow leakage. The 
phonation times given in Fig.5 (top) show that the aerial 
escape is marked by a rather short T, and an increment in 
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the duration of the open interval T,. The flow injection 
during the open phase is produced at the beginning of the 
interval, the pressure decaying to the static conditions as 
the aerial escape does not allow an efficient burst to be 
injected. The result is a profile with a premature and 
incomplete open phase which gives the mirror-like 
reflection appearance to the profile when compared 
against the well known L-F one. 

The objective of surgery in this case was to restore the 
vocal fold anatomy to normality by an excision to 
eliminate the edema and stroma and the epithelial hiper- 
keratosis. The post-surgical glottal excitation produced is 
shown in Fig.5 (bottom). The resulting profile is much 
more similar to the classical L-F one. The amount of 
mucosal wave is now more balanced along the phonation 
cycle, the tension observed during the closing phase 
having disappeared. The 2-mass cover model behaviour 
seems to have been recovered. 

The normalization of the glottal excitation profile is 
stated in the duration of the different intervals: the return 
phase is more relaxed, the closure has been increased and 
the burst due to flow injection recovers the classical 
hunchback-like pattern. Aerial escape has almost 
disappeared, although a slight leakage is still appreciable 
as a result of the phonation gesture acquired during the 
persistence of the pathology. Its complete elimination is 
the role reserved for voice rehabilitation procedures to 
grant the success of surgery and avoid the recursive 
appearance of the pathology. The control of flow and the 
improvement in the glottal closure efficiency result in a 
larger and better conformed flow injection. The dynamics 
of the vocal folds has been re-established with a better 
adduction and a re-established mucosal wave pattern. 


V. CONCLUSIONS 


The mucosal wave correlates may serve to track the voice 
quality during the rehabilitation phase to optimize the 
altered functionality of the vocal folds. The mucosal 
wave appears as a critical element for the study of voice 
pathology, as it may help in determining the duration of 
each phase and its relative relevance or condition. 
Pathologies inducing an increment in tension produce a 
reduction of the mucosal wave in certain parts of the 
phonation cycle. The study of the mucosal wave profile 
may serve as an indicator of the phonation conditions, not 
only to determine the presence of pathology, but to 
establish the evolution of treatment in an objective way. 
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DATA WAREHOUSE FOR PROSODY FEATURES 


Jana Krutisova, Jana Klečková 
Department of Computer Science and Engineering, University of West Bohemia, Pilsen, Czech Republic 


Abstract: Speech is the most direct and intuitive form 
of human communication. Speaker uses to emphasize 
his utterance a set of non-verbal features. It is called 
prosody. The prosody serves a critical information 
for the recognition and understanding system. This 
paper describes an idea to use data warehouse 
properties for storage of prosody features. 

The work presented in this paper was supported by 
the project number 2C06009. 

Keywords: Spontaneous speech, prosody features, data 
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I. INTRODUCTION 


The main goal of this paper is introduction to a feasible 
solution of problems concerning prosody features in 
spontaneous speech. 

Speech is the most direct and intuitive form of human 
communication. The people do not speak monotone word 
by word, but they use elements, that affect and accent the 
interpretation of the parol utterance. The people gesture, 
due to many variations in melody, intonation, pause and 
accent of speech represent their emotion and their spirits, 
for example joy, anger, sadness, surprise or fear. The set 
of these non-verbal features is called prosody. The 
prosody serves a critical information for the recognition 
and understanding system. These non-verbal features 
support and emphasize an utterance meaning. 

The speech recognition quality is increased by the 
speaker’s style determination by using prosody features, 
because prosody 
= dependences on speaker’s age and sex 
= represents speaker’s attitude and emotion. 
= usually betrays speaker - foreigner. 
= is affected by aphasia. 


II. PROSODY AND ITS FEATURES 


Prosody is integrated exclusively into spoken speech. It 
is set of features, so-called suprasegmental component of 
an utterance. Just missing prosody is the main reason of 
unnatural synthetic speech sounding. Prosody advances 
intelligibility of utterances, on the other hand a false 
understanding of prosody can modify its meaning. 

Fundamental prosodic features included fundamental 
frequency Fo, voice energy and a speaking rate. Accent, 
intonation, emotive timbre, pause, filling and repeating 
group derived prosodic features. 


These attributes play an important role for a correct 
recognition and understanding of spontaneous speech and 
they can be detectable on prosodic segments like as 
= Fundamental unit of spoken speech. 
= Integrated intonation unit. 
= Syllable group with one word accent. 
= Elementary segment, where prosody can be used. 

Why is not only speech recognition, but also nonverbal 


communication and speaker identification very 
important? 
In Czech language with its free-word-ordering 


intonation serves a critical information for the recognition 
and understanding system. For some sentence the 
intonation is essentials to determine the core of a 
communication, depending on a speaker who emphasizes 
a meaning of the sentence. The design of the module for 
suprasegmental type processing is based on the 
partitioning of the speech into sentence. 

There are words with the same sounding, but these 
words have a different meaning. Correct understanding of 
these words consists in utterance context. 

On the other hand in Czech stress syllable a vowel 
quality just as a vowel quantity do not differ from 
unstress syllable. 

Everybody of speakers, but even the same speaker 
speaks the same word differently. It depends on situation 
and background, where it happens. A word, but also a 
whole utterance can be pronounced in more or less 
disturb and noisy backround. 

It is important for utterance style, whether a speech 
appears from a read text or speech is spontaneous 
utterance. A hearer recognizes these two categories 
although their lexical , syntactic and semantic structures 
are identical. Prosody contribution is very significant for 
distinguish between perception of read text and 
spontaneous speech perception. A spontaneous speech, 
especially in a dialogue, replies to an utterance purport. It 
is affected not only by its subject, but also by the others 
cues, speaker’s feeling or noise environment. 

A perception and understanding of an utterance 
meaning make for various long pauses. These pauses 
could not ever become evident or these could be shorter 
in read text with the same subject. An utterance is 
actually encapsulated unit both in light of meaning and 
purport and in light of sound side. An utterance of a 
speaker (for example a foreigner or a small child) can be 
only a word sequence, understandable in the existing 
situation, but it can not be a sentence in light of grammar. 
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On the other hand a sentence can contain only one 
short word, its purport and its meaning is unmistakable 
from context. For example a short sentence ,,And ?“ can 
represent the same meaning as a sentence ,,What 
happens?“ or ,,What will happen?“ Even it can be 
substituted only by gesture in a specific situation. 
Prosody distinguishes every speaker, because people 
usually do not use a literaty language in everyday 
communication. Wherefore prosody often show up for 
example geographic or social speakers origin, it shows 
speaker’s attitude. 

The prosody importance appears also from previous 
work concerning automatic dialogue acts recognition in 
Czech based on sentences structure. 


III. USE OF A DATA WAREHOUSE 


There are many factors affecting a verbal 
communication, gesture, a context of a previous 
utterance and a next utterance, information about 
speaker’s individuality, his customs and his attitude. 


fiends OTR 


i URI 


Fig.1 Star schema for a storage of prosody characteristics 


By above reasons it is very important to look at 
multidimensional view. Although a data warehouse 
technology is not standard process for spontaneous 
speech area, we try to use multidimensional database 
architecture and its properties for storage of prosody 
characteristics. A hypercube as an underlying structure 
from data warehouse technology can be implemented by 
star schema. In this case, all data are contained in two 
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types of tables called a fact table and a dimension table. 
There is a single fact table in a center of a star schema. 


IV. DISCUSSION 


A fact table contains the measurements or metrics or 
facts of processes in view. In addition to the 
measurements, a fact table contains foreign keys for the 
dimension tables, several dimension tables are used of 
text information storage about values in a fact table. The 
dimension attributes also contain one or more hierarchical 
relationships. Our schema consists of three dimension 
tables and fact table for the first experiment (Fig.1). 


V. CONCLUSION 


The future goal will try to use data warehouse 

properties for storage of prosody features. 

= Design multidimensional model for saving of prosodic 
features. 

= Select suitable tool for implementation of the designed 
model. 

" Verify, if we can use data warehouse properties for 
spontaneous utterances monitoring and how we can 
do it. 
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Abstract: There are a number of clinical conditions 
that affect directly or indirectly the function of the 
vocal folds and thereby the pressure waveforms of 
elicited sounds. If the relationships between the 
clinical conditions and the voice quality are 
sufficiently reliable, it should be possible to detect 
these diseases or disorders. The focus of this paper is 
to determine the set of features and their values that 
would characterize the speaker’s state of vocal folds. 
To the extent that these features can capture the 
anatomical, physiological, and neurological aspects of 
the speaker they can be potentially used to mediate an 
unobtrusive approach to diagnosis. We will show a 
new approach to this problem, supported with results 
obtained from two disordered voice corpora. 
Keywords : Model, glottal pulse, pathological voice 


I. INTRODUCTION 


Production of voice is influenced by the cognitive, 
neurological and physical state of speaker. In fact, voice 
production depends on the precise interaction of many 
components including anatomical, physiological and 
neural aspects of the body. It is, therefore, not surprising 
that voice characteristics would be affected by a wide 
range of disorders and diseases. Hormonal imbalance, 
neurological disturbances, lung disease, and mental 
functioning can influence and often interfere with the 
ability to produce a clear and intelligible voice. 
Conversely, it should be possible to use acoustical 
analysis of signals generated by patients to assess the 
health and the mental state of the patient. 

Existing attempts for voice-based diagnosis have been 
based on features which are only remotely connected to 
the physical characteristics of the vocal folds. We 
describe a new method to estimate vocal fold dynamics 
using a parametric model of glottis movements in order to 
assess the health of the vocal folds and detect 
pathological conditions of the larynx. This approach 
would ultimately enable clinicians to assess and diagnose 
individuals using only their vocalizations. Although the 
sensitivity and specificity of the diagnosis are likely to be 
limited, this is a very feasible approach for triaging 
individuals for further testing and treatment. We envision 
that in the future this diagnosis can be performed over the 
telephone. Therefore, the analysis would be conducted as 


an unobtrusive exam and would contribute to the comfort 
the patient. 


II. METHODS 


Our general approach to the diagnosis of larynx 
pathologies consisted of two phases: (1) estimation of the 
vocal tract transfer function H(@) and the pitch F, and 
(2) estimation of the parameters of the best fitting glottal 
pulse generating model. This approach is similar to the 
previous work on the characterization of the quality of 
voice using parameters of the Fant model [1]. The vocal 
tract transfer function was estimated assuming that the 
frequency distribution of the glottal pulse within the 
relevant region was approximately constant. Given this 
estimate, we found the parameters of a mathematical 
model that would maximize the correspondence between 
the observed and synthetically generated utterances 
filtered by H(@). The resulting computed speech signal 
model was fitted to the speech signal. In particular, the 
best-fitting set of models’ parameters was then estimated 
for each subject’s data by maximizing the correlation 
between the computed and the actual signals prior to the 
lip transformation. The optimization was performed using 
the Nelder-Mead simplex search method, because of the 
complex error surface due to nonlinearities, 
discontinuities and the complex interactions among the 
model parameters. The block diagram of the process of 
estimation of the parameters is shown in Fig. 1. 

In this study we report the results based on the 
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Fig. 1. Block diagram of the parameters estimation 
process. 
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Fujisaki-Ljungqvist (FL) model [2]. The glottal flow and 
its derivative of this model are represented by polynomial 
segments. The choice of a polynomial model provides a 
convenient way to vary the number of parameters, which 
is useful for evaluating their relative importance. In its 
most elaborate form, the model has three timing 
parameters controlling open phase duration, pulse skew 
and the time interval from glottal closure to maximum 
negative flow D. In addition, there are three amplitude- 
related parameters controlling the slope at glottal opening 
A, the slope prior to closure B and the slope following 
closure C. Although the offset parameter A (see Fig. 2) 
has not been in prior applications, we have included it 
since a secondary excitation can often be observed at 
glottal opening. The rounded closure, that is often evident 
in the glottal flow waveforms, is sometimes attributed to 
a gradual glottal closure leaving a small residual flow 
after the main excitation stops. We also included a 
component attributable to the period of negative flow due 
to the lowering of the vocal cords following the glottal 
closure. The mathematical representation of the glottal 
flow in the FL model is given by: 
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Fig. 2. One period of FL model showing glottal flow 
U¿(t) and glottal flow derivative E(t) and its parameters. 


We have evaluated three different methods of 
obtaining the estimation of the glottal pulse from the 
speech signal: 
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1) Estimation process of the vocal tract using the LPC 
coefficients [3], based on a questionable assumption that 
the glottal pulse is a pulse train with a uniform spectrum. 
The order of the LPC coefficients was selected to 32 (for 
sampling frequency of 16 kHz) to be sufficiently high to 
characterize the vocal tract, and yet sufficiently low not 
alter the shape of the glottal pulses. 

2) Cepstral [4], in which the vocal tract estimation is 
based on the notion that the frequency ranges of the vocal 
cord filtering action and the glottal forcing functions do 
not overlap. This method uses homomorphic filtering, 
whereby the multiplication of the transfer functions is 
transformed into an addition as a consequence of the 
logarithmic transformation. In particular, this method is 
based on cepstral filtering — liftering. 

3) Interactive Adaptive Inverse Filtering [5], where the 
vocal tract transfer function is estimated by minimizing 
the contribution of the average glottal pulse. This method 
iterates two phases. The first phase generates an estimate 
of the glottal excitation, which is subsequently used as 
input of the second phase that generates a more accurate 
estimate. Typically, the inverse filtered signal is no 
longer than a couple of hundreds of milliseconds to 
ensure minimal changes in the vocal tract transfer 
function. 

The evaluation process involved estimation of the 
model parameters, constructing feature vectors and using 
those for the classification of the voice samples. The 
vector of features included the maximum value of the 
correlation function, the amplitude normalized 
parameters of a model (4, B, C) and the temporal 
parameters (R, F, D) multiplied by Fo. These features 
were estimated using three samples of 32ms window 
from a modeled subject at the 100ms, 200ms and 300ms 
from the beginning of the utterance. The classification 
was implemented with a feed-forward back-propagation 
network using gradient descent error for learning. The 
topology of the neural network comprised one input 
layer, one layer of hidden units and one output layer. A 
separate network was used for each estimation technique: 
number of inputs depends on the model, m hidden units 
and | output unit. The number of hidden units is in the 
range of 5-54. This neural network approach was chosen 
because of its computational efficiency, performance and 
simplicity. 


HI. RESULTS 


In order to evaluate this approach, we used the Kay 
Elemetrics Disordered Voice Database [6], that comprises 
over 1,400 voice samples of approximately 700 subjects 
and includes sustained phonation and running speech 
samples from patients with a wide variety of organic, 
neurological, traumatic, and psychogenic voice disorders, 
as well as from 53 normal speakers. We used only 
utterances with steady pronounced vowel /a/. In addition, 
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we used the Korean Disordered Speech Database [7] that 
consists of 28 benign and 31 malignant pathological 
speakers and 41 normal speakers. This database was 
collected using the Kay Elemetrics database as a 
template. The utterances in this database are vowels /a/, 
/e/, /i/, /o/, /u/. Again we used only the vowel /a/. The 
sampling frequency and the bit resolution is the same as 
in Kay Elemetrics. However we have down-sampled all 
the data to 16kHz for both databases. In this paper we 
describe the classification results obtained with databases 
combined. 
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Fig. 3. Maximum value of correlation for FL models with 
respect to method of inverse filtration. 


Since the classification is based on the correspondence 
between the models and the data, we first present the 
frequency distribution of the correlations between the 
model and the data shown in Fig. 3. The graph shows the 
distribution of the maximum values of correlation of the 
model and particular inverse filtering method. Although 
this model generally fits a large proportion of the 
speakers, there was small number of cases with only 
marginal fit to the model. This was mostly due to the 
effects of the pathology of the glottal signal generation 
process. An example of the ability of the model to fit 
pathological speakers is shown in Fig. 4-6. 
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Fig. 4. Signal fit for a pathological speaker, solid line is 
real speech and dashed line is re-synthesized speech from 
the model (max of correlation value is 0.972). 
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Fig. 5. Signal fit for a pathological speaker, solid line is 
real speech and dashed line is re-synthesized speech from 
the model (max of correlation value is 0.932). 


1 


0.8 H 


amplitude 
o 


L 
5 10 15 20 25 
time [ms] 


Fig. 6. Signal fit for a pathological speaker, solid line is 
real speech and dashed line is re-synthesized speech from 
the model (max of correlation value is 0.804). 


The classification was performed using feed-forward 
neural networks trained individually for each type of 
diagnosis. In order to prevent over-fitting, we used a 
cross-validation approach to train the classifier [8]. The 
results of test sets are shown in Fig. 7. 


IV. DISCUSSION 


The results of the binary classification process are 
shown in Fig. 7 in terms of the proportion of correct 
discrimination between pathological and healthy speakers 
(sensitivity and specificity). We have used a confusion 
matrix to determine accuracy of the methods. In each 
case the neural network was determined using binary 
classification of specific pathology vs. normal. The 
resulting performance of the glottal pulse model in 
conjunction with the simple neural network classification 
process is commensurate with many clinical tests. 

In case of “A-P squeezing” and “A-P squeezing 
(mild)” we found that the results of a mild case of this 
pathology yields worse accuracy compared to the fully 
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Fig. 7. Accuracy of pathology detection using FL model and all three inverse filtering methods. 


develop pathology. This is analogical to general 
findings, since the mild case of diseases is closer to a 
healthy state, therefore, it is harder to recognize it as a 
diseases. 

Also results for the case of “pathological voice — 
diagnosis N/A” would confirm that this approach is 
suitable for general detection of the pathologies since 
the group consists of variety of pathological voices 
without known diagnosis. 


V. CONCLUSION 


These results suggest that this method has a potential 
to triage pathologies in human voice and moreover, 
relate the values of the parameters to the state of the 
speech generation mechanisms. The average accuracy 
of detection across the pathological voice and normal 
voice was for LPC method 88.7%, for cepstral method 
90.83% and for IAIF method 92.42%. We achieved the 
best average accuracy of detection across the 
pathological voice and normal voice using FL model 
with IAIF method. 
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Abstract: Cough is a symptom and central element 
for diagnosis of very common respiratory affection 
causes of death and loss of productivity in intensive 
pig farms. The aim of this research is the 
comparison between acoustic features of cough 
sounds originating from infectious and non 
infectious diseases. The acoustic parameters 
investigated are Peak Frequency and Duration of 
cough signals. The differences resulting from the 
sound analysis confirmed variability in acoustics 
parameters according to a state of health or 
disease. Infections change the status of respiratory 
system; thus infectious cough (I) sounds are 
different than healthy ones (H). Duration of single 
coughs is significant different among the classes 
analyzed: non infectious coughs (H), Actinobacillus 
(A) and Pasteurella’s (P) ones. Frequency analysis 
allows a more general classification between H and 
I. Sounds can be used in an alarm system based on 
an algorithm to identifies automatically cough 
sounds and provide early warning system for the 
farmer about the health status of his herd. 
Keywords : cough, diseases, prevention, sound 


I. INTRODUCTION 


Respiratory pathologies are frequent in pig 
husbandry and cough is their principal symptom. The 
importance of coughing as a means of prognosis has 
been shown since pig vocalisation is directly related to 
pain and classification of such sounds has been 
attempt [6]. In this regard, there have been studies to 
identify the characters of coughs in pigs and 
automatically identify them in field recordings for 
diagnostic purposes [1, 11, 12, 13, 14]. 

The following analysis considers databases of 
coughs collected both in field and lab condition. Two 
types of cough were infectious, caused by 
multrifactorial respiratory diseases mainly caused by 
Actinobacillus Pleuropneumoniae and Pasterurella 
Multocida, the third type of cough was chemically 
induced in lab conditions. Actinobacillus 
Pleuropneumoniae is considered as a main primary 
bacterial agent which causes pleuropneumonia [9] 


whilst Pasteurella Multocida is the most important 
secondary one [2,8]. 

Actinobacillus pleuropneumoniae causes 
Pleuropneumonia and it currently consists of a 
widespread problem in intensive pig breeding farming. 
It interacts with Mycolplasma, Arterivirus or PCV-2. 
Pasteurella Multocida is an opportunist invader main 
cause of pulmonary pasteurellosis so often associated 
to Herpesvirus (PRV), Arterivirus, and Mycoplasma 
Hyopneumoniae. It is also cause of the progressive 
atrophic rhinitis, a significant cost-effective problem 
in the worldwide farms. Drop of production to slow 
death with progressive decay is typical of these 
diseases and prevention with strategic medical 
treatments is often ineffective and costs are often 
bigger than benefits. 

The aim of this work, by comparing I and H, is 
improving labelling (classification) of coughs recorded 
giving physic values to specifics sounds that will be 
used as inputs in an automatic alarm system based on 
an algorithm that will recognize cough sounds from an 
installation in a farm and will provide early warning to 
the farmer on the welfare status of his herd. 


II. MATERIALS AND METHODS 
A. Animals 


I have been collected in two affected pig farms 
fattening compartments, both of them served for the 
Parma ham production and hosted 200 animals 
divided in 10 to 16 barns, in each farm. The floor was 
fully slatted and liquid feeding was served. The 180 
Pasteurella sick pigs (40 kg) were a hybrid strain 
Landrace x LW + Danish Duroc boar. The serologic 
diagnosis (isolation in pure culture) and the 
necroscopic results (hypertrophyic lung section with 
blank areas necrotic focuses and fibrinous pleurisy) 
assured a pneumonia due to Pasteurella Multocida 
associated to other infectious agents. The 200 pigs 
suffering from infection due to A. Pleuropneumoniae 
(26-35 kg) were a Italian Landrace X Large White X 
Duroc cross. The necroscopy showed haemorrhagic 
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and necrotic lung lesions.Others concurrent infections 
were also present. 

H was induced by inhalation of citric acid (namely 
0.8 moles per litre of citric acid dissolved in solution 
of 0.9% NaCl) in six Belgian Landrace x Duroc 
piglets (20-40 kg) free from respiratory diseases. 
These sounds have been recorded in lab conditions 
(for more information on this installation, and the data 
acquisition process see [7] ). 


B. Sound analysis 


For I sound acquisition 7microphones (Monacor 
ECM 3005) were used with a frequency response of 
50-16000 Hz, connected via preamplifiers (Monacor 
SPR-6) to an 8 channel Soundscape (SS8IO-3). The 
Soundscape unit, which allows for simultaneous 
recording was connected via a TDIF cable to a PCI 
audio card (Mixtreme 192). All recordings were 
sampled at a sample rate of 44.1 kHz with a resolution 
of 16 bit. All microphones were hanged in the stable. 

H were caused by a temporary irritation of the 
upper respiratory tract caused by stimulation of the 
cough receptors directly resulting in coughing. On the 
contrary I were caused, in P case, by a deep bacterial 
infection of the lungs since the infectious process 
starts at the alveolar bronchiole junction producing 
exudates and in the A disease by a lung and pleurisy 
lesion with large red-blue areas in the upper 
diaphragmatic lobes with an overlying pleurisy. 

The characteristics of the cough sounds were 
identified in both time and frequency domain. The 
signal from the microphone was band pass filtered 
between 100 Hz and 10800 Hz to get rid of the low 
frequency noise. A comparison between healthy and 
sick coughs sounds has been made by considering the 
duration of the signal and the energy in the frequency 
content. The duration of a single cough, the number of 
hits and the time between the coughs in a cough attack 
were considered. This is illustrated in Fig .1. 


Amplitude (dB) 


Time (s) 


Frequency (Hz) 
rta 


Time (s) 
Fig.1: Pig cough attack (14 hits showed) represented 
in time domain (above) and in frequency domain 
(below).The arrows indicate the parameters studied. a) 
length of a cough, b) time between two cough, c) total 
length of the cough attack. 
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These parameters have been counted with auditive 
and visual observation on the sound spectrum by the 
operator using Adobe Audition program. For every 
cough signal the peak frequency (maximal energy 
content) was calculated. The analysis of variance has 
been done on both the length of single coughs and 
cough attacks among the three classes of coughs to 
evaluate the certain interclasses distinction in time and 
frequency domain. For recording and labelling of the 
cough sounds in both lab and field Adobe Audition 1.5 
was used, for the signal processing Matlab 7.1 and 
SAS statistical package 2004 for the statistical 
analysis. 


III. RESULTS 


During the recording sessions we collected 
85lcoughs from pigs affected by P and 186 coughs 
coming from pigs sick of A coming from respectively 
91 and 26 cough attacks. 

The average number of coughs in a cough attack 
was 13 for H and 9 and 7 for P and A ones (table I). 


Table 1. number of cough attacks and single coughs in 
the collected database. 


Min Max.nr. | Mean 
Type of | Nr. Nr.coug | nr.coug | cough number 
cough attacks hs hs in | in of 
attack attack coughs 
H 11 149 4 22 13.54 
P 91 851 5 25 9.35 
A 26 186 3 19 7.15 


The results are illustrated in tables I and table IL. 
The comparison made against the database of H 
investigated first of all the duration of the sounds. 


Table 2. duration of both cough attack and single 
sound signals, standard deviation of mean duration of 
single coughs. 


Type of | Mean duration | Mean duration | DS single 
cough attack (s) single cough (s) | coughs 

A 5.17 0.53 0.70 

H 8.61 0.43 0.13 

P 6.77 0.67 0.2 


Concerning the differences in length of the three 
classes of single coughs and attacks investigated the 
variance analysis results (SAS,GLM) show highly 
significantly differences among the classes (P<0.001). 
The results among the duration of the three classes of 
cough attack show that the length of the coughs attack 
has a significantly difference between H and A 
(P<0.0387) and between A and P (P<0.0493) but not 
between H and P (P<0.3418). 

The analysis lead over peak frequency of the single 
cough shows that lung diseases lower the peak 
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frequency of the cough. There is a significant 
difference between peak frequency of coughs 
originating from A and H cough sounds. The range for 
H is between 750 Hz and 1800 Hz for peak frequency. 
For P and A this is between 200 Hz and 1100 Hz 
(table II). The peak frequencies of P coughs are 
clearly lower than H cough sounds (H VS P: 
P>0.0062; significant), but less significant than with A 
(P > 0.0694) (table III; Fig. 2). Highly significant is 
also the diversity between H and A coughs having P> 
0.00002. 


Table 3. peak frequency mean among the three 
classes of single coughs. 


Type of cough Peak frequency Range 
A 200-1100 Hz 
H 750-1800 Hz 
P 200-1100 Hz 


Peak Frequency among the three classes of single cough sounds 
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Fig.2: boxplot of the peak frequency of the three 
classes of analysed coughs. The representation shows 
the values obtained from the frequency analysis 
divided in quartiles. the rectangle contains the mean 
50% of the distribution and the horizontal line is the 
median. The difference between the two sick coughs 
and the healthy one stands in a lower mean of the 
maximum frequency in sick coughs. 


IV. DISCUSSION and CONCLUSIONS 


The possibility to make a distinction between 
pathological and healthy cough sound by physical 
sound features is shown. As this work improves 
characterisation of the features of cough, caused by 
specific agents, in terms of acoustical parameters, it 
will be useful to improve cough sound labelling as it 
provides significant differences between cough arising 
from infected or non infected animals. Literature in the 
past already focused on this distinction, but 
specifically in humans. Van Hirtum and Berckmans 
shown already several ways to work with pig cough, 
from the assessment of the cough towards vocalization 
[11] through the automated recognition of spontaneous 
versus voluntary cough [12] to the recognition of 
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cough sound by using an algorithm for recognition in 
lab condition [14]; anyway literature on acoustic 
features of different respiratory diseases is still 
unknown. In this paper sound analysis considers 
features like frequency energy content and duration of 
cough. 

In terms of peak frequency, of cough signal, sick 
coughs show a significantly lower peak frequency than 
healthy coughs (200-1100Hz for I and 750-1800 Hz 
for H). This incongruous with the findings of Korpas 
et al. who state that frequencies of 300 Hz to 500 Hz 
are the most expressive in healthy human coughs 
whereas in cough sounds of bronchitis the bands 
between 500-1200 Hz are the most expressive [5]. 
Sound differences in cough between humans and pigs 
can be explained by differences in the amount of air 
pushed in through the air pipe or by the dimension and 
characteristics of the air pipe itself. On the other hand, 
Van Hirtum and Berckmans [14] and Ferrari et al. [4] 
showed that the fundamental frequency for non 
infectious pig cough sounds in laboratory conditions is 
higher than those of infectious coughs; our study in 
field conditions confirms their results. 

When considering the duration of a single cough, it 
can be seen that there is a significant difference 
between the two groups of cough sounds, having a 
mean duration of 0.53-0.67 s for A and P while 0.43 s 
was observed for H. This lead us to consider the 
length of these signals as a tool to distinguish sounds. 
The trend was also observed by other authors, 
concluding that the duration of infectious coughs is 
longer compared to non infectious ones due to airways 
obstruction by infection and inflammation [5, 11] both 
in sick humans and pigs. Concerning the duration of a 
single cough or a cough attack in the whole nothing is 
found in literature. Further analysis should be done to 
clarify these findings. Although a connection between 
the time and frequency domain characteristics and 
physical system parameters for pig vocalizations is not 
yet known, the present results indicate that such a 
connection exists and remains to be determined. By 
understanding the effect of respiratory airway 
inflammation and structural changes of its cell walls 
on cough sounds, information can be extracted about 
the status of the animals. In field situations this can 
lead to an interesting acoustic monitoring system. The 
acoustics features characterizing a sick cough can be 
used as inputs for on-line cough counters algorithm. 

It is suggested that the present application integrated 
in an automatic detection system can be used to 
continuously monitor animal health and might help in 
advance animal welfare in pig houses considered the 
controls problems due to the high number of animals 
hosted. This automatic approach can save medical 
costs and supply information of how to face, in terms 
of bio security, the problem of prevention and spread 
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of respiratory pathologies especially unavoidable 
diseases like the multifactor ones in intensive farms. 

Dunlop [3] and Stevens and [10] stated that 
approximately 62% by weight of the antimicrobials 
have been concerned for several years about the large- 
scale use of in-feed antimicrobials at subtherapeutic 
levels in food animal production [15]. The potential 
risks include chemical residues in meat and the 
development of resistance to commonly used 
antimicrobials by bacteria important in human 
medicine. As a result, the pig industry and the 
regulatory bodies are attempting to limit the use of 
antimicrobials and encouraging improved biosecurity, 
management practices and vaccination policies in pig 
units. 

Modern pork production is searching for a variety of 
tools to ensure health, welfare and productivity of 
pigs. Considering the instability of the use of 
antibiotics a new tools in prevention like sound 
analysis looks promising. Sound analysis in field 
conditions provides additional, non invasive 
quantitative informations and is candidate for 
developing automatic on-line health monitoring tool. 
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Abstract: This paper presents predictions of the 
consequences of tongue surgery on speech 
production. For this purpose, a 3D finite element 
model of the tongue is used that represents this 
articulator as a deformable structure in which 
tongue muscles anatomy is realistically described. 
Two examples of tongue surgery, which are 
common in the treatment of cancers of the oral 
cavity, are modelled, namely a hemiglossectomy 
and a large resection of the mouth floor. In both 
cases, three kinds of possible reconstruction are 
simulated, assuming flaps with different stiffness. 
Predictions are computed for the cardinal vowels 
li, a, u/ in the absence of any compensatory 
strategy, i.e. with the same motor commands as the 
one associated with the production of these vowels 
in non-pathological conditions. The estimated 
vocal tract area functions and the corresponding 
formants are compared to the ones obtained under 
normal conditions. 

Keywords: biomechanical modelling, tongue 
surgery, glossectomy, speech production 


I. INTRODUCTION 


Resection surgery can be required in case of a 
cancerous tongue tumour or for particular pathologies 
like a macroglossia, characterized by an abnormally 
voluminous tongue. In case of noticeable loss of bulk 
or volume, the tongue is reconstructed using a local or 
distant flap in order to limit the functional 
consequences, of which choice is still a debated 
question. 

The surgical procedure can impair the tongue 
mobility and tongue deformation capabilities, which 
can deteriorate the three basic functions of the human 
life, namely mastication, swallowing and speech. The 
surgery consequences can then induce a noticeable 
decrease of the patients’s quality of life. The current 
project aims at developing some software that would 
allow surgeons to predict the consequences of a 
tongue resection for a given patient, using a 3D 
biomechanical model of the oral cavity, combined 
with a synthesizer based on the vocal tract area 
function. By now, the model has been tested for two 
common exeresis schemes for a particular subject. In 
this paper, we first introduce briefly the model used 
for this study and the implementation followed for 
two glossectomies (resection and reconstruction). 
Then we present the results obtained for the cardinal 
vowels /i, a, u/ in terms of formants deviations and 


tongue mobility, compared to the non pathological 
case. 


Il. METHODS 


A 3D biomechanical tongue model 

The 3D biomechanical model of the oral cavity used 
in this study was originally designed by Gérard et al. 
[1] and was further enhanced for speech production 
control [2] (Fig. 1). The tongue and the hyoid bone 
are represented by mobile 3D volumetric meshes, 
while the jaw, teeth, palate, and pharynx are modelled 
by static surface elements describing the oral cavity 
limits with which the tongue interacts due to 
mechanical contacts. 


Figure 1: 3D model of the tongue in the midsagittal plane 
(apex on the left). 


Modelling tongue resections 

To model a surgical resection followed by a flap 
reconstruction, the muscles fibres located in the 
resected area are removed and the biomechanical 
properties of the corresponding elements are modified 
to account for the elastic properties of the flap. 
Tissues stiffness identical to the one of the passive 
tissues, 5 times smaller or 6 times higher are 
considered. In addition, since little is known about the 
force generation capabilities of muscles that have 
been partially shortened, three options were tested for 
the activation of sectioned fibres: 1) no activation, 2) 
low activation or 3) similar level of activation as in 
the normal case. Additional details about our general 
modelling approach can be found in [3]. 

The first simulated surgery corresponds to a left 
hemiglossectomy (Fig. 2, right panel). The left part of 
the styloglossus is removed as well as the left anterior 
parts of the longitudinal muscle, of the transversalis, 
and of the verticalis, and the upper part of the left 
hyoglossus. The medium and anterior parts of the left 
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genioglossus are nearly entirely removed, whereas its 
posterior part is only partially affected. 

The second simulated surgery corresponds to a large 
mouth floor resection (Fig. 2, left panel). In that case, 
the mobile tongue is totally preserved. The anterior 
part of the genioglossus is removed as well as the two 
major muscles of the mouth floor, namely the 
geniohyoid and the mylohyoid muscles, in their 
whole. 


Figure 2: Left: modelling of a mouth floor resection; right: 
modelling of a left hemiglossectomy 


Motor control of the model 

The tongue model is deformed and controlled by a 
functional model of muscle force generation 
mechanisms, namely the Equilibrium Point 
Hypothesis [4]. Motor commands have been first 
inferred for the original structure for the three studied 
vowels, and simulations were then carried out for the 
various surgery conditions with these original 
muscles’ motor commands hold during 200 ms. 
Motor commands selection was based on 
considerations on the tongue shapes in the mid- 
sagittal plane [5] combined with published EMG data 
[6][7]. 


From tongue shapes to acoustic properties 

The final tongue surface was interpolated by natural 
cubic spline curves. Then, intersections between the 
different articulators and a 3D semi-polar grid were 
computed to estimate the vocal tract area function. 
The associated formants were finally computed and 
compared with each others. 


III. RESULTS 


A. Impact of a left hemiglossectomy 
Only the results for the second case (intermediate 
level of activation for the sectioned fibers) are 
presented, most fibers being either intact or fully 
removed after resection. 


(a) Impact on the tongue mobility 
After a hemiglossectomy, we noticed an important 
deviation of the apex, either on the healthy tissue side 
for vowels /u/ and /i/ (Fig. 3) or on the flap side for 
vowel /a/, as well as its rotation. The deviation is 
more or less important for the different vowels 
according to the flap biomechanical properties. After 
reconstruction, the smaller the stiffness of the flap, 
the larger the asymmetry of the tongue shaping. This 
is especially true for vowels /i, u/, due to the 
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styloglossus activation, but also for vowel /a/, 
probably due to the combined activation of the 
anterior genioglossus and hyoglossus, two muscles 
slightly effected by the exeresis. In the case of vowel 
/al, we also found a more important flattening of the 
tongue with decreasing flap stiffness: a high stiffness 
flap restrict the tongue movements. 


Figure 3 : Impact of a left hemiglossectomy on the tongue 
symmetry for vowel /i/. (a): non pathological case, (b)-(d): 
reconstruction with flaps of increasing stiffness (0.2, 1 or 5 
times the stiffness of passive tongue tissues). 


(b) Impact on the acoustic signal 
Figures 4 and 6 show the variations of the first two 
formants associated with the different resections and 
reconstructions. 


700 1000 1600 1400 1200 1000 600 
F? (Hz) 
Figure 4: Fl/F2 formant patterns for a left 
hemiglossectomy for flaps with different stiffness (x-marks: 
small stiffness, crosses: medium stiffness; diamond: high 
stiffness). Triangles join the extreme vowels obtained with 
the non-pathological model. 


A left hemiglossectomy (Figure 4) has a negligible 
impact on the production of vowels /a/ and /u/. For /i/ 
the formants deviation is more important, resulting in 
an average increase of 8% for Fl and average 
decrease of 9% for F2. In terms of formant changes, a 
softer flap seems to have less impact, particularly for 
/a/, but the differences between flaps are slight. These 
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results are consistent with the variations observed on 
the tongue shapes. 


B. Impact of a large mouth floor resection 

The anterior part of the genioglossus being resected, 
large discrepancies appeared between the three cases 
studied concerning the modeling of force generation 
in sectioned fibers. The simulations showed that the 
smaller the activity of the sectioned fiber, the more 
important the differences with the non pathological 
case. Since the implementation of the resection is 
done symmetrically, no rotation of the tongue was 
induced. 


(a) Impact on the tongue mobility 
The simulations revealed a large impact of mouth 
floor resection on tongue elevation and protraction 
movements, for vowels /u/ and /i/. The mylohyoid 
muscle allows the rigidification of the mouth floor, 
essential to tongue elevation. Furthermore, the 
posterior genioglossus is the main muscle involved in 
protraction movement. Its partial resection limits the 
contraction of the anterior part of the tongue base. For 
vowels /a/, a high stiffness flap limits the tongue 
mobility and limits the flattening of the tongue. 
For the different vowels, a high stiffness flap seems 
the most appropriate voice. Figure 5 shows the results 
for vowels MÁ for with the different reconstruction 
schemes and for the non pathological case. A high 
stiffness flap favors the tongue  protraction 
movements whereas a small stiffness flap can lead to 
a total obstruction of the vocal tract. Similar results 
were observed for vowels WA and \a\ (reduction of the 
airway section in the pharyngeal area). 
The hypotheses made concerning the activation of the 
sectioned fibers lead to significant differences: 
obstruction or not of the vocal tract for vowel WM, 
backward rotation of the apex for vowel \a\ in the 
absence of activation (inactivation of the anterior 
genioglossus that cannot counteract anymore the 
activation of the hyoglossus) and backward 
movement more or less pronounced for \z,qa,u\. 
Comparison of simulation results with data collected 
on patients could shed light on the hypothesis (no 
activation, partial activation or full activation) that 
seems to be the most realistic. However, the choice of 
the activation did not impact the effect of the flap 
properties on the tongue mobility. 
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Figure 5: Shape of the tongue in the mid-sagittal plane after 
a mouth floor resection for vowel \i\ (mid level of activation 
for the sectioned fibers). The plain contour represents the 
non pathological case, the dotted contour the reconstructed 
model with the small stiffness flap, the dashed contour the 
medium stiffness flap and the dashed-dot contour the high 
stiffness flap. 
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(b) Impact on the acoustic signal 

Figure 6 plots the first and second formants for vowel 

M for the partial and full activation hypotheses. 

Results can be summarized as follows: 

+ A large mouth floor resection seems to have 
severe consequences on speech production. For 
vowel /u/, keeping the motor commands inferred 
for non- pathological conditions leads to an 
obstruction of the vocal tract in the pharyngeal 
region, due to the resection of the anterior part of 
the posterior genioglossus that counteracted the 
effects of the styloglossus activation before 
surgery. Therefore, not formant could be 
computed. 

e The current pattern of activation did not permit to 
produce the high front vowel /i/ (average increase 
of 23% for Fl and average decrease of 17% for 
F2), with important discrepancies according to 
the flap. A high stiffness flap leads to a higher 
increase of Fl, whereas a small stiffness flap 
leads to a higher decrease of F2. 

e For vowel /a/, we can observe a decrease in Fl 
and F2, particularly for low stiffness flaps, 
correspond to a deviation from vowel /a/ to 
vowel /o/. 

Combined with the tongue shapes observation, our 

results show that for mouth floor resection high 

stiffness flap should be favoured. Indeed, only this 
kind of flap can allow the tongue to reach a front high 

position close to /i/. 
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Figure 6: F1/F2 formant patterns for a mouth floor resection for 
flaps with different stiffness (small stiffness represented by x- 
marks, medium stiffness by crosses and high stiffness by 
diamonds). Triangles join the extreme vowels obtained with the 
non-pathological model. Top panel: no activation, bottom panel 
low activation for the sectioned fibres. 


III. DISCUSSION 


Simulations with a realistic 3D biomechanical model 
could be of a significant improvement in planning 
tongue surgery systems. In terms of F1/F2 patterns 
changes our results are in good agreement with 
measurements made on patients [8]. The role of the 
flap stiffness on tongue mobility could also be 
assessed and, interestingly, it is different for the 
hemiglossectomy than for the mouth floor resection. 
Further improvements of the model include 
algorithmic aspects aiming at a significant decrease of 
the computation time and mesh matching methods to 
design patient specific oral cavity models. 
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Abstract: The work presented in this paper was supported 
by the project number 2C06009. Verbal communication 
is the most obvious instrument used to express our 
thoughts and ideas, considering only this part of 
speech without regarding its nonverbal part, may lead 
to overlooking important information of utterance or 
even misunderstanding it. The contributed paper 
deals with use of automatic system for recognition of 
facial expressions which have been being created for 
the Czech dialog system. 

Keywords : Face detection, feature extraction. 


I. INTRODUCTION 


Understanding human emotions and their nonverbal 
messages is one of the most necessary and important 
skills for making the next generation of human-computer 
interfaces (HCI) easier, more natural and effective. 
Indeed, the first step toward an automatic emotion 
sensitive human-computer system having the ability to 
automatically detect users’ nonverbal signals is the 
development of an accurate and real-time automatic NVC 
analyzer. Such an analyzer must deal mainly with users’ 
facial expressions and paralanguage. 

Nonverbal communication has many functions in the 
communication process. By virtue of nonverbal 
communication, we simply express our emotions. In 
many cases we are able to exhibit our feelings by facial 
expression and gestures much more quickly than by using 
words. It regulates relationships and may support or 
replace verbal communication [1]. 

On the other hand nonverbal communication has its 
disadvantages and seamy sides too. Difficulties may arise 
if communicators are unaware of the types of messages 
they are sending, and how the receiver is interpreting 
those messages. No dictionary can accurately classify 
nonverbal signals. Their meanings vary not only by 
culture and situation, but also by the degree of intention 
of their use. Many of them are ambiguous and could 
cause misunderstandings. Effective communication is the 
combined harmony of verbal and nonverbal actions. 
Main categories of nonverbal communication: facial 
expression, posture, gesture, proximity, gaze, 
paralanguage, touch, adornment. 


II. METHODOS 


Facial expression carries most of our nonverbal 
meanings and often is considered as the most important 
category of nonverbal communication (by many experts 
55-85 percent of NVC is exchanged by them). Although 
the human face is capable of creating 250,000 
expressions, less than 100 sets of them constitute 
meaningful symbols. Three main categories of 
conversational signals have been identified: syntactic 
display - used to stress words, or clauses (raising or 
lowering eyebrows can be used to emphasize a word or 
clause), speaker displays: illustrate the ideas conveyed (“I 
don’t know” can be expressed by the corners of the mouth 
being pulled up or down), and listener comment display - 
used in response to an utterance (incredulity can be 
expressed with a longer duration of eyebrow raising). 

The goal of the our research can be divided into the 
following topics: speech signal processing focused on the 
speaker recognition, definition of a speaker dependent 
features suitable for the speaker recognition, automatic 
recognition of facial expression. The first goal aimed at 
the speech signal processing focused on the speaker 
recognition was accomplished by the proposal of the 
voice activity detection (VAD) using neural network. The 
VAD with an error lower than 1% is a good result. The 
second goal was accomplished by defining a new set of 
the speaker dependent features / the Speaker Dependent 
Frequency Cepstrum Coefficients (SDFFCC). The third 
goal - we have described the structure of Automatic 
Recognition of Facial Expression (ARFE) and have 
seen that the most important stages of the system are: the 
face localization, the Gabor wavelet representation of the 
facial image and the classification of the stage performed 
by Adaboost. 


HI. RESULTS 


The dataset consists of 45 adult volunteers and 15 
infants. None of the subjects wore eyeglasses. Some of 
the subjects had hair covering their foreheads, no subject 
wore caps, or had makeup on their brows, eyelids or lips. 
The subjects included both male (60%) and female (40%). 
The important condition was maximum illumination with 
a minimum of facial shadows. The primary idea was to 
ask each volunteer to look at some examples of all 6+1 
facial expressions (happy, fear, anger, disgust, sadness, 
surprise and neutral) and try to copy them. Also the 
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primary idea was that each expression has to be repeated 
several times and the best one is chosen for the training 
set. Unfortunately, in reality this was different. During 
this data gathering, we have come to know that people are 
able to exhibit their neutral expression with hardly or no 
effort. Also, we have come to know that they are able to 
simply smile for many minutes without any pauses. But 
other expressions such as a fear expression, disgust 
expression or angry expression are very difficult and 
maybe impossible for people without theatrical 
experiences. Therefore sometimes a simple facial 
expression recording lasts more than a half hour for one 
person in place of only two minutes planned. All these 
problems resulted in the fact, that in the present day, the 
gathered training set provides only three acceptable facial 
expressions: neutral, happy and surprise. Unfortunately, 
also not all of the surprise expressions are really perfect 
surprise expressions. The final training set contains 75 
images, 25 images per expression from 60 volunteers. The 
significant role of facial expressions convinces us to use 
visual input to process and analyze them. The facial 
expression recognition problem can be divided into the 
following three partial problems: face detection; facial 
feature extraction; facial expression classification. In 
despite of significant advances of computer vision in 
recent years, developing robust and accurate facial 
expression recognition in an automatic way and in real- 
time is still very problematic and at present belongs to one 
of the greatest dreams and most active areas in the 
computer vision. The system automatically detects frontal 
faces in complex backgrounds and makes classification 
for each found face (see Fig.1). The only requirements of 
the system are frontal faces, a good illumination condition 
and acceptable light direction. In other words, faces 
should not contain shadows and must be well lighted. 
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Fig.1 Experiment conditions 


After a face was detected in an arbitrary image, which 
could be a digitized video signal or a digitized image, the 
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face finder returns the coordinates of a square box around 
the face. 
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Fig.2 ARFE - static images mode 
IV. DISCUSSION 


One of the factors, which put down the accuracy of 
the current version of ARFE is the used face detection and 
localization system provided by OpenCV. ARFE is based 
on state of the art approaches, is a multi-user system (see 
Fig. 2) and has two working modes: static images and 
dynamic images (photo and video) mode. This is possible 
due to the fact, that each frame is processed and classified 
separately This system provides an excellent face 
detection system. But unfortunately, the face localization 
performed by this system is not accurate enough for an 
automatic facial expression recognition system and 
without any doubt is in need of an improvement [2]. This 
problem could be solved by a combination of the present 
day ARFE and a local approach dealing directly with 
facial features. 


V. CONCLUSION 


In this paper, we presented two variants of a The work 
presented in this paper was supported by the project under 
contract number 2C06009. new method for automatic 
dialog acts recognition based on word clusters. A 
prototype of the dialog system is being developed in the 
Department of Computer Science. The proposed system is 
fully automatic, user-independent and real-time working. 
First experiments show that the speech recognition quality 
is increased by using automatic facial expression 
recognition system [3]. Obtained results are interesting 
and at least show that designing of a fully automatic facial 
expression system in a constrained environment in the 
present day is possible. 
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Abstract: This paper suggests the nonlinear parameter 
of self-similarity as a novel feature to be employed in 
wavelet packet based voice signal analysis. Two 
groups of normal and pathological voice signals have 
been decomposed using wavelet packets. Next, self 
similar characteristics of reconstructed signals in 
each node have been calculated. Consequently, 
discrimination ability of each node has been obtained 
using Davies-Bouldin criterion. In the following, eight 
most discriminant nodes have been identified to 
construct feature vector parameters. To reduce the 
feature vector dimensionality Principal component 
analysis (PCA) has been employed. Finally, an 
artificial neural network has been trained to classify 
normal and pathological voices. The results show that 
self-similarity parameter can be a reliable feature in 
wavelet packet based voice signal analysis. Moreover, 
selected sub-bands are distributed over the whole 
available frequencies which shows that pathological 
factors do not influence specific frequency range 
which accentuates the role of WP decomposition. 


Keywords: vocal disorder, wavelet packet, self- 
similarity, Davies-Bouldin Criterion 


I. INTRODUCTION 


The vibration pattern of the vocal folds, excited by the 
air-flow through the glottis, is an important indicator of 
laryngeal function. In fact, any abnormality of the larynx 
will be evident in the glottal waveform and reflects on the 
audible quality of speech. Pathological voices are 
strongly corrupted with random variations of their 
features, which often assume the aspect of noise [1]. 

Unilateral vocal fold paralysis (UVFP) is caused by 
injury to the recurrent laryngeal nerve. Patients with 
UVFP may have significant impairment of vocal fold 
function, including a breathy paralytic dysphonia. UVFP 
most commonly occurs following a surgical iatrogenic 
injury to the vagus or recurrent laryngeal nerve. It results 
in glottal incompetence, either partial or complete, 
because of the poor or reduced vocal fold closure 
resulting in a weak and uncoordinated vocal fold 
vibration. Such irregularities in the pattern of vocal fold 
vibration might induce pitch frequency fluctuations, 
airflow volume changes, amplitude and mucosal wave 


reduction and also the noise-like turbulence of airflow in 
vicinity of the cords. 

Physicians often use invasive techniques like 
Endoscopy to diagnose symptoms of voice disorders. It 
is, however, possible to identify disorders using certain 
features of speech signal in a non-invasive way. Several 
research groups have recently used wavelet packet based 
feature extraction. Schuck ef al [2] have used Shannon 
entropy and energy features of wavelet packet 
decomposition and the best basis algorithm for 
normal/pathological speech signal classification. Fonseca 
et al [3] have employed mean squared values of 
reconstructed signals in discrete wavelet transform sub- 
bands and least square support vector machine classifier 
for identification of signals from patients with vocal fold 
nodules and normal signals. Guido er a/ [4] have tried 
different wavelets on the search for voice disorders. 
Mother wavelet of Daubechies with support length of 20 
(db10) was found as the best wavelet for speech signal 
analysis among commonly used mother wavelets. 
Behroozmand ef al [5] have used genetic algorithm for 
optimal selection of wavelet packet based energy and 
Shannon entropy features for identification of patients’ 
speech signal with unilateral vocal fold paralysis (UVFP). 
The results showed that the decomposition level of five is 
the most appropriate level to analyze pathological speech 
signals. Local discriminant bases (LDB) and wavelet 
packet decomposition have been used to demonstrate the 
significance of identifying the signal subspaces that 
contribute to the discriminatory characteristics of normal 
and pathological speech signals in a work by Umapathy 
etal [6]. 

Matassini et al [7] have analyzed voice signals in a 
feature space consisting quantities from chaos theory 
(like correlation dimension and first lyapunov exponent) 
besides conventional linear parameters among which 
nonlinear parameters have reported to have clear 
separation between normal and sick voices. Two 
nonlinear features of return period density entropy 
(RPDE) and fractal self-similarity have been studied by 
Little et al [8] for speech pathology detection and it has 
been shown that these two nonlinear measures, based 
parsimoniously upon the biophysics of speech 
production, can be both simple and robust, and are 
amenable to implementation as online algorithms. 
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This work deals with classification of normal and 
pathological voice signals (herein, patients with UVFP) 
with the following procedure: measurement of self- 
similarity parameter of the reconstructed signal of each 
node in WP decomposition, discriminant feature selection 
based on Davies-Bouldin index, normalizing feature 
vector data, dimension reduction of feature vector by 
means of PCA, and finally implementation of artificial 
neural network for classification purpose. 

Following the introduction, methods and materials are 
reviewed in the next section. The results of this study are 
discussed in section III. Finally, section V presents the 
conclusions. 


II. METHODOS 
A. Wavelet Packet Transform 


Recently, wavelet packets (WPs) have been widely 
used by many researchers to analyze voice and speech 
signals. There are many outstanding properties of wavelet 
packets, which encourage researchers to employ them in 
many widespread fields. It has been shown that sparsity 
of coefficients’ matrix, computational efficiency, and 
time-frequency analysis can be useful in dealing with 
many engineering problems. The most important, 
multiresolution property of WPs is helpful in voice signal 
synthesis. 

The hierarchical WP transform uses a family of 
wavelet functions and their associated scaling functions 
to decompose the original signal into subsequent sub- 
bands. The decomposition process is recursively applied 
to the both low and high frequency sub-bands to generate 
the next level of the hierarchy. WPs can be described by 
the following collection of basis functions [5]: 


W, (2x1) = 42? Y h(m-20/27W, (2x-m) (1) 
ar (2? x-1) = V2? Yg(m-222W,(22x-m) @ 


W, 
where p is scale index, / the translation index, 4 the low- 
pass filter and g the high-pass filter with 

g(k) = (DMI k) (3) 


The WP coefficients at different scales and positions 
of a discrete signal can be computed as follows: 


cr, =? Y {mMW,@m-h (4) 
Cry =Y h(m-21).C?,, (5) 
Cr = >, g(m-21.CP, (6) 


The basic reasoning behind wavelet packet based features 
is that it can exploit and remove information 
redundancies, which usually exist in the set of samples 
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Fig. 1. Estimation of parameter œ for a sample voice signal. 


obtained by the measuring devices. Also, wavelet packets 
take the advantage of multi-resolution analysis. 


B. Self-similarity 


Although different definitions of second-order self- 
similarity can be found in the literature, they share the 
common idea of processes which do not change their 
qualitative statistical behavior after aggregation. The 
Hurst exponent(H )characterizes the level of self 
similarity, providing information on the recurrence rate of 
similar patterns in time at different scales. Several 
methods are available to estimate the Hurst parameter. In 
this paper, wavelet based scaling exponent estimation has 
been employed to calculate Hurst exponent. Let Y(t) be a 
continuous-time second-order process with spectral 
density 7, (v), v e R . It can be shown [9] that the second 


moments of the details, satisfy d, (j,k), 
. ; 2 
Ed, (GAY = | E, (92 [P] av (7) 


where Y,(v) = Í Po 0e "dt is the Fourier transform 


of Y, . These second order quantities take a particularly 
simple form in the case of Long-Range Dependence 
(LRD), where by definition the spectral density follows a 
power-low near the origin: 
TW) ~e, bl. 
Because of the inherent scaling properties in the 
wavelet basis {Vir (1) =2"Pw,(Q t-k), j,k e z} 


which are naturally matched to the scaling properties of 
LRD processes, one obtains: 


Edy (¡hy -2'%c,Cla y), ¡oo 9) 


p|>0,0 e[0,1),c, >0 (8) 


where C(a,y,) is an integral independent of scale j. The 


scaling parameter «a can therefore be estimated by 
measuring the slope in a log-log plot of an estimate of the 
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left hand side of (9) against j. Because of the ability of 
wavelets to quasi-decorrelate scaling processes, such an 
estimate has excellent statistical properties. Moreover, to 
obtain Hurst parameter (H) the slope (œ) can be 
transformed asH =(@+1)/2. Further review of these 


methods can be found in the literature [9]. 
C. Davies-Bouldin index 


The Davies-Bouldin (DB) criterion has been proven to 
be effective in many biomedical applications when used 
to evaluate the classification ability of feature space [11]. 
The DB index (DBI), or cluster separation index (CSI) is 
based on the scatter matrices of the data and is usually 
used to estimate class separability. It requires the 
computation of cluster-to-cluster similarity: 


_(D,+D,) 

=p, O) 

where D, and D, are the dispersions of the ith and jth 

clusters, respectively, and D; is the distance between 
their mean values. D, and D, are given by: 


N; 


p, -|3X 


i n=l 


1/2 
y, -mf y, Eclusteri (10) 


and 
D, =|, -m,| (11) 


where N, is the number of members is cluster i, y, is 
the nth sample vector of cluster i, and m, is the mean 
vector of the cluster i. DBI is obtained through 
determining the worst case of separation for each cluster 
and averaging these values as follows: 


1 K 
DB=—) 


i=l 


Das R; (12) 
where K is the total number of clusters. Herein, K is two 
and DBI assesses the separability of Normal and disorder 
voices clusters. It is shown that lower values of the DB 
indexes indicate higher degree of cluster separability. 
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Fig.2. DB values of first eight discriminant nodes. 
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D. Database 


Used in this study are sustained vowel phonation 
samples from subjects from the Kay Elemetrics 
Disordered Voice Database [12]. It includes signals from 
53 normal voices, 54 unilateral vocal fold paralysis, 20 
vocal fold polyp, and 20 vocal fold nodules. This 
represents a wide variety of organic and neurogenic voice 
disorders. Subjects were asked to sustain the vowel /a/ 
and voice recordings were made in a sound proof booth 
on a DAT recorder at a sampling frequency of 44.1 kHz. 


III. RESULTS AND DISCUSSION 


The mother wavelet function of the tenth order 
Daubechies (db10) and decomposition level of five has 
been chosen in wavelet packet decomposition. Then, self- 
similarity of the reconstructed signal in each sub-band of 
the wavelet packet decomposition has been measured. 
The most discriminant nodes have been selected 
according to DB criterion. Fig 2 shows the DB value of 
the discriminant nodes. As an illustration, the self- 
similarity of the signals in the node (0), the most 
discriminant node, has been demonstrated in fig 2. the 
wavelet packet tree and the participating nodes in feature 
vector has been shown in Fig 3. Having on hand the 
feature vectors for all normal and pathological voices, 
Principal component transformation has been performed 
on the previously normalized feature vectors’ data. PCA 
analysis of the feature vector led to optimum dimension 
of six as the input feature vector of artificial neural 
network (ANN). Then, 70 percent of the data has been 
used for training ANN and 30 percent of the remaining 
data has been used as validation and test data. Finally a 
feedforward backpropagation multilayer classifier with 
three hidden layers has been trained to classify voice 
signals. The classification accuracy of 98 percent among 
test and validation samples shows that self-similarity 
based Wavelet Packet Feature Extraction is reliable for 
normal and pathological voice signal analysis. 
Furthormore, neural network classifier is an effective tool 
for voice signal analysis. 
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Fig. 3. The discrimination ability of the node (0) 
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Fig. 4. The most discriminant nodes in terms of self-similarity. 


V. CONCLUSION 


In this study an efficient Self-similarity Based Wavelet 
Packet Feature Extraction method and Davies-bouldin 
criterion based optimal feature vector selection technique 
has been utilized for the classification of normal voices 
and pathological voices of patients suffering from 
unilateral vocal fold paralysis by means of artificial 
neural network. The classification accuracy of 98 percent 
shows that the proposed nonlinear parameter of self- 
similarity is a discriminant feature if calculated for 
reconstructed signals of selected sub-bands. Furthermore, 
Davies-bouldin criterion is an effective feature vector 
selection tool for it has low computational cost. Also, 
implementation of wavelet packets for feature generation 
plays an important role to select discriminant feature 
vector from the whole frequency range which takes the 
advantage of its multi-resolution properties. 
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Abstract 


A new sinusoidal model based engine for FESTIVAL 
TTS system which performs the DSP (Digital Signal Pro- 
cessing) operations (i.e. converting a phonetic input into 
audio signal) of a diphone-based TTS concatenative sys- 
tem, taking as input the NLP (Natural Language Process- 
ing) data (a sequence of phonemes with length and into- 
nation values elaborated from the text script) computed by 
FESTIVAL is described. 

The engine aims to be an alternative to MBROLA and 
makes use of SMS (“Spectral Modeling Synthesis”) repre- 
sentation, implemented with the CLAM (C++ Library for 
Audio and Music) framework. 

This program will be released with open source license 
(GPL), and will compile everywhere gcc and CLAM do 
(1.e.: Windows, Linux and Mac OS X operating systems). 
Index Terms: TTS, SMS, MBROLA, FESTIVAL, GPL. 


1. Introduction 


The whole DSP speech synthesis process is based upon the 
SMS model and consists in three logical steps: analysis of 
concatenative unit database, diphone transformations plus 
concatenation and synthesis. 

In this section we provide a brief history of analy- 
sis/synthesis models for speech synthesis and a short in- 
troduction of the sinusoidal plus residual model. In section 
2 we will describe the SMS analysis and synthesis steps. 
In section 3 we will focus on our custom diphone transfor- 
mation and concatenation algorithms. 


1.1. Preamble: a brief history 


Analysis/synthesis models for speech signal processing ap- 
peared in mid-thirties when the VODER was created by 
Homer Dudley, inspired by VOCODER; later, in sixties, 
Flanagan invented his Phase Vocoder (PV). In the mid 
eighties Julius Smith developed the program PARSHL for 
the purpose of supporting inharmonic and pitch chang- 
ing sounds. This approach is better suited for analysis 
of inharmonic and pseudo-harmonic sounds. At the same 
time, independently, Quatieri and McAulay developed a 


Analysis window 
A A Peak Peak Peak 
> x > FFT w+) detection > estimation — continuation 


s(t) I 


+ Spectral sine 
> x we FFT > + < generator |<t— frame's sine 
iw magnitudes, 
Residual window n A frequencies 
p frame's residual and phases 

spectrum 


Figure 1: SMS Analysis scheme flow-chart. 


similar technique for analyzing speech. In late nineties 
Yannis Stylianou worked on Harmonic plus Noise Model 
(HNM) for concatenative TTS synthesis systems. Har- 
monic and modulated noise components are separated in 
the frequency domain by a time-varying parameter, re- 
ferred to as maximum voiced frequency, Fm. 


1.2. Sinusoidal plus residual model 


We briet y introduce the sinusoidal plus residual represen- 
tation, which separates the input audio signal into a sum of 
partials plus an inharmonic, noise-like part, called resid- 


ual: 
R 


s(t) = Y A, cos[27 frt + 0,] + e(t) (1) 
r=1 
where s(t) is the audio signal, e(t) the residual component, 
Ar, fr and 0, are respectively the amplitude, frequency 
and phase of the r-th sinusoid. 


2. SMS - Analysis and Synthesis 


SMS [1] is a set of techniques for processing audio signals 
which implements the sinusoidal plus residual model. The 
task of SMS analysis is to extract some spectral parameters 
from the time-domain signal. From this kind of data we 
can obtain a time domain signal through the SMS synthesis 
step. 


2.1. SMS Analysis 


In Fig. 1 we can see the block diagram of the whole SMS 
analysis behavior. It is based upon the Short Time Fourier 
Transform (STFT): the signal is cutted into consecutive 
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overlapping frames, which are multiplied by an analysis 
window and for each of these chunks a FFT is computed. 
We obtain a spectrum from which we detect the compo- 
nents present in the original sound. The harmonic analy- 
sis consists of peak detection, pitch detection and spectral 
peak continuation. When the harmonic analysis is com- 
pleted, the residual component can be computed, and thus 
the whole SMS analysis step is achieved. 


2.1.1. Peak detection 


In audio processing time-varying sinusoids are called par- 
tials, and each of them is the result of a main mode of vi- 
bration of the generating system. A partial in the frequency 
domain can be identiTed by its spectral shape (magnitude 
and phase), its relation to other partials, and its time evo- 
lution. So the Trst step of SMS analysis is the detection of 
partials, which are searched among prominent magnitude 
peaks of the current frame spectrum. 

Most natural sounds are not perfectly periodic and do 
not have nicely spaced and clearly deTned peaks in the fre- 
quency domain. A practical solution is to detect as many 
peaks as possible and delay the decision of what is a deter- 
ministic, or “well behaved” partial, to the next step in the 
analysis: the peak continuation algorithm. 


2.1.2. Pitch detection 


Before continuing a set of peak trajectories through the 
current frame it is useful to search for a possible fundamen- 
tal frequency. If it exists, we will have more information 
to work with, and it will simplify and improve the tracking 
of partials. 

The fundamental frequency can be deTned as the com- 
mon divisor of the harmonic series that best explains the 
spectral peaks. It is possible that the common divisor does 
not belong to the set of detected peaks. For this reason 
the fundamental frequency is better called pitch (i.e. that 
particular frequency heard to be the main frequency of a 
sound). The algorithm that choose the fundamental fre- 
quency can be simply described in three main steps: 1. 
Choose possible fundamental candidates. 2. Measure the 
“goodness” of the resulting harmonic series compared with 
the spectral peaks. 3. Get the best candidate. 


2.1.3. Peak continuation 


From peak detection we obtain some “wrong behaved” 
partials that shouldn’t be considered. The basic idea of the 
algorithm is that a set of “guides” advances in time through 
the spectral peaks, looking for the appropriate ones (ac- 
cording to the speciTed constraints) and forming trajecto- 
ries out of them. The instantaneous state of the guides, 
their frequency and magnitude, are continuously updated 
as the guides are turned on, advanced, and Tnally turned 
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Figure 2: SMS Synthesis scheme flow-chart. 


off. When a fundamental has been found in the current 
frame, the guides can use this information to update their 
values. Peak continuation algorithm completes harmonic 
analysis. 


2.1.4. Stochastic Analysis 


Referring to formula (1), the stochastic component e(t) of 
the current frame is calculated by Trst re-generating the 
deterministic signal with additive synthesis, and then sub- 
tracting it from the original waveform s(t) in the time do- 
main. This is possible because the phases of the original 
sound are matched, so the shape of the original waveform 
is preserved. The stochastic representation is then obtained 
by performing a spectral Ttting of the residual signal. 


2.2. SMS Synthesis 


The SMS synthesis process is described in Fig. 2 where we 
can see the two inputs that come from the analysis (possi- 
bly transformed) representing the deterministic (harmonic) 
component and the residual one as described in section 2.1. 

The whole signal is processed in the frequency domain, 
where the two components are treated independently, then 
we return to time domain performing an inverse FFT. For 
the deterministic component the goal is to obtain the spec- 
trum of a sum of sinusoids. The stochastic signal is ob- 
tained by Tltering white noise with residual spectral en- 
velope. Then we can use a single iFFT for the combined 
spectrum. Finally in the time domain we impose the trian- 
gular window in the overlap-add process, combining suc- 
cessive frames to get the time-varying characteristics of the 
sound. 


3. Speech Synthesis Architecture 


The SMS engine performs the DSP operations of a text-to- 
speech system, taking as input a phonetic Tle computed by 
FESTIVAL [2], which describes the pronunciation of the 
text script through a sequence of phonemes with length in 
ms and intonation values in Hz. 
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Figure 3: Scheme of speech synthesis architecture opera- 
tions. 


Our program aims to be an MBROLA [3] alternative 
and thus it makes use of the same phonetic input format 
and command line parameters. Both programs implement 
concatenative synthesis using diphones (that must be ex- 
ternally supplied) as base units. 

The three logical steps upon which the whole DSP 
speech synthesis process is based are: 


1. analysis: it has been theoretically described in sec- 
tion 2.1 and consists in converting the time do- 
main diphones’ database into a spectral parameters 
one stored in Sound Description Interface Format 
(SDIB); 


2. transformations plus concatenation: it will be de- 
scribed further in this section; 


3. synthesis: it has been described in section 2.2. 


Fig. 3 shows a simpliTed block diagram of the trans- 
formations plus concatenation and the synthesis steps. 


3.1. SMS Transformations 


The task of the transformation step is to adapt every 
required diphone to the parameters speciTed (for each 
phoneme) by the phonetic input. By now those parame- 
ters describe only the duration of the phoneme and its pitch 
evolution. 

So the main spectral transformations needed are time 
stretching and pitch shifting. Both these operations mod- 
ify magnitude and frequency of signal partials. These new 
values become incoherent with original phases. This prob- 
lem affects the re-synthesized signal, deteriorating its au- 
dio quality. 

Thus it is necessary to re-calculate the sinusoids’ 
phases; two ways for doing this are available: the phase 
continuation and the relative phase delay algorithms. 


3.1.1. Time Stretching 


The time stretching process is based upon linear interpo- 
lation and decimation of frequencies and magnitudes. The 
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algorithm preserves transition integrity and intelligibility 
between different phonemes in every diphone, as much as 
the input time requirements are satisTed. 


3.1.2. Pitch Shifting 


Pitch Shifting is the transformation that takes care of mod- 
ulating the intonation of the sentence to be uttered. It is 
performed after time stretching to better Tt the intonation 
requirements. 

The Pitch Shifting routine is implemented using a for- 
mant preserving algorithm, that tries to maintain the orig- 
inal timbre of sound. The magnitude of each transformed 
partial is placed upon the original spectrum envelope, cor- 
responding to the original frequency scaled by a common 
factor. 


3.1.3. Phase Continuation 


The simplest way to reconstruct the phases is called phase 
continuation. Its behavior is to arbitrarily set the phase for 
every partial of the Trst frame and then compute the val- 
ues frame-by-frame. In this way the algorithm discards all 
original analyzed phase data. The formula used to propa- 
gate phase values is the following: 


OF = OF, + TEE, + fF) (2) 


being ff and 0% respectively the frequency and phase of 
the k-th partial of the i-th frame, and H the hop size. 


3.1.4. Relative phase delay representation 


A more complex method, theorized in [4], that helps to 
preserve the original waveform is based on the relative 
phase delay representation of the phase, deTned as the dif- 
ference between the phase delay (phase/radian frequency 
ratio) of the partials and the phase delay of the fundamen- 
tal. This makes the waveform characterization independent 
from the phase of the Trst partial. 

Once the relative phase delays are computed for each 
frame it is therefore possible to propagate the phase of the 
modiTed fundamental, as described in equation (2), and 
rebuild the waveform by adding the relative phase delays 
to the new fundamental phase delay. 


3.1.5. SMS Diphones Concatenation 


Once transformed, two successive diphones have to be 
concatenated. This operation is mainly based on the 
time stretching interpolation and decimation subroutine, 
although also pitch shifting is used. Basically the behavior 
of this operation is to morph the last frames of a diphone 
with the Trst frames of the following one. The pitch of 
those frames is matched and then magnitude and frequency 
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Figure 4: SMS-MBROLA Synthesis comparison. 


are interpolated. This assures a smoothing between frames 
of the phoneme belonging to consecutive diphones. 


3.2. The CLAM framework 


The CLAM (C++ Library for Audio and Music) frame- 
work [5] aims to offer extensible, generic and efTcient de- 
sign and implementation solutions for developing Audio 
and Music applications and it is perfectly suited for imple- 
menting the SMS model. 

It simpliTed a lot our task since it is quite complete and 
includes all utilities needed in a Sound Processing Project 
(input/output processing, storage, display...). Moreover its 
good design allows easy adaptation to any kind of need. 

The project is released under GPL version 2 (or later). 
It is Platform Independent (compiles under GNU/Linux, 
Windows and Mac platforms) and thus it is quite simple to 
create portable applications. 


4. SMS-MBROLA comparison 


Even if the system is still at a preliminary stage, 
some comparisons with the state of the art MBROLA 
concatenative speech synthesis are already available at 
“http://www.pd.istc.cnr.it FESTIVAL/home/SMS.htm”. 

We have synthesized some samples from the same 
phonetic Tles with our engine and MBROLA. We have 
also used diphone databases obtained from the same au- 
dio recordings (a collection of 1299 diphones of the Italian 
language.) 

However it must be observed that the size of the SMS 
analyzed database is about 10 times more than the original 
time domain one (~70 MB against 6.5 MB). MBROLA 
synthesis engine is faster than ours, however performance 
was out of the scope of this work. 

Either MBROLA and our engine synthesize well intel- 
ligible phrases and our concatenation routine works quite 
good even in those cases in which several phonemes are 
rapidly spoken. Anyway MBROLA synthesis audio qual- 
ity is often cleaner than ours, which is sometimes affected 
by some hoarseness. In Fig. 4 the sentence “il colombre” 
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has been synthesized by SMS (on the left) and MBROLA 
(on the right). 


5. Conclusions 


A new sinusoidal model based engine for concatenative 
TTS system has been presented. This engine has been 
proved to be comparable, in terms of audio quality and in- 
telligibility, with similar, state of the art systems. 

We are evaluating some strategies in order to improve 
the engine. The most important ones are: 


e analysis veriTcation process, in order to correct 
some artifacts that may occur in the analysis step; 


e SMS pitch synchronous operations; 


alternative implementation of pitch shifting and time 
stretching, in order to obtain better audio quality; 


voice quality parameters support, such as Spectral 
Tilt, in order to perform emotional TTS synthesis. 
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Abstract: Unlike many acoustic measures, Cepstral 
Peak Prominence (CPP) has shown consistently high 
correlations with subjective vocal quality ratings. 
However, this superiority of the CPP index is reported 
based on empirical results, with its theoretical 
advantages not always clearly stated. In this paper the 
properties of the CPP which makes it a good predictor 
for vocal quality are addressed, as well as how it 
differs from other measures. The reported 
experimental setups of the previous studies are 
analyzed, and reasons for the observed variability in 
the results are given. After this discussion, the clinical 
usefulness of CPP is addressed. This paper can be 
useful for researchers as well as for clinicians, in 
planning the experimental setup and interpreting the 
relevance of the results. 

Keywords: Cepstrum, vocal quality, breathiness, 
roughness 


I. INTRODUCTION 


Many acoustic measures have been proposed to 
correlate with overall vocal quality or one of its 
dimensions (i.e.  breathiness, roughness, strain, 
hoarseness, etc.; an extensive tabulation of methods can 
be found in [1]). In spite of the large number of measures 
available, there is a lack of consistent results across 
different studies for most of the measures (e.g. jitter, 
shimmer, HNR) [2]. Recent work [3][4][5] has shown the 
Cepstral Peak Prominence (CPP) [6] or its smoothed 
version (CPPs) [7] to correlate highly with vocal quality 
dimensions and overall grade. The high correlation in 
these latter studies has been consistent and notably 
superior to the other acoustic measures considered. 

There are several topics, though, which should be more 
clearly stated. Most of these works provide experimental 
data, where CPP results better based on empirical 
evidence. The theoretical advantages of CPP over the rest 
of the acoustic measures are not always clearly stated. 
Some of the experimental setups also favor abnormally 
high values for the amount of variance explained. 

In this paper we address the properties of the CPP that 
makes it a good predictor for vocal quality, and how it 
differs from other measures. Besides, the reported 
experimental setups of the previous studies are analyzed, 


and reasons for the observed variability in the results are 
given. After this discussion, the clinical usefulness of 
CPP is addressed. This paper can be useful for clinicians, 
willing to interpret the results of the acoustic measures, as 
well as for researchers, in planning the experimental 
setup and explaining the relevance of the results. 


II. CPP PROPERTIES 


The origin of the CPP measure in [6] is, as its 
companion measure RPK for autocorrelation peak, a 
basic pitch detector. Both measures were devised to 
appraise the prominence of the peak that should occur at 
the pitch value in the cepstrum and autocorrelation 
functions, respectively. As such, CPP is sometimes 
erroneously believed to be a measure of signal 
periodicity, when, in fact, it only measures the periodicity 
of the signal spectrum. It is precisely this subtle 
difference (measuring spectral harmonic periodicity 
instead of strict periodicity) what makes it particularly 
suited for vocal quality measures, superior to many other 
measures. 

Following is a categorization of measures in five 
groups, according to signal characteristics, which have 
been most correlated with different vocal quality 
dimensions. The first two are the more common 
amplitude (shimmer) and frequency (jitter) perturbations, 
absent in the original studies on CPP [6][7], while the 
other three groups are the ones actually included in those 
studies. The sensibility of CPP to the signal 
characteristics is commented, as well as its possible 
advantages and drawbacks compared to other measures. 


1. Amplitude perturbations (shimmer). 

Signals with amplitude perturbations have frequently 
been related to roughness [3], and sometimes with 
breathiness. The traditional measures of shimmer are 
obtained in the time domain, relying on a Pitch Detection 
Algorithm (PDA). CPP is sensitive to shimmer, since 
shimmer affects the spectral harmonic structure [8]. CPP 
values diminish as shimmer increases, and can be more 
robust than time-domain techniques relying on a PDA. It 
has been shown that shimmer, jitter and time-domain 
Harmonics-to-Noise Ratios (HNR) are quite sensitive to 
even small errors in the pulse boundaries [9]. 
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2. Frequency perturbations (jitter) 

Jitter shares a similar condition than shimmer, being 
mostly related to roughness, and less frequently to 
breathiness. Jitter affects spectral structure to a greater 
extent than shimmer [8] and CPP can therefore be also a 
good measure of this perturbation. The same advantage 
regarding the sensibility of time-domain measures of 
Jitter to errors in the PDA holds in this case in favor of 
CPP. 


3. Additive Noise 

The presence of additive noise has been related mainly 
with breathiness. The prominence of the cepstral peak is 
also affected by increasing levels of noise, since it 
reduces the dip between harmonics. In fact, several 
studies have focused in this property to develop HNR 
measures [10][11]. CPP holds the advantage with respect 
to time-domain HNRs of not requiring accurate PDAs, 
and with respect to many frequency-domain HNRs which 
require the determination of the harmonic frequencies. 
Existing HNR measures have been regarded as overall 
disperiodicity measures, since they have been shown 
sensitive also to jitter and shimmer [10][8][11]. CPP also 
shares this feature, being sensitive to these three groups. 


4. First Harmonic Amplitude 

A high amplitude of the first harmonic (with respect to 
the second harmonic [7] or to the first formant [12]) has 
been related to breathiness. The underlying assumption is 
that breathy voices do not produce abrupt glottal closures, 
producing an excitation which is more rounded, almost 
sinusoidal. First harmonic amplitude prominences are 
closely related to glottal flow measures like the amplitude 
quotient or the speed quotient [13]. Here the CPP is 
superior to its companion RPK in [6] and to other HNR 
measures. The CPP will produce no prominent peak for a 
perfect sinusoid, since a sinusoid consists of only one 
harmonic (no spectral periodic structure). That is the 
main difference with other periodicity measures: a 
perfectly periodic signal not necessarily produces a high 
CPP. This lack of higher harmonics is also typical of 
nasal voices [1], extending the sensibility of CPP to the 
nasality dimension. 


5. Spectral Tilt 

An increment in the energy content in the higher 
portion of the spectrum has been related to breathiness 
[14]. CPP is not able to measure spectral tilt changes, 
which would be reflected in the lower part of the cepstra, 
discarded for its calculation. However, spectral tilt 
measures have been reported to have the smallest 
relevance in breathiness ratings in several other studies 
[6][7]. CPP inability to follow spectral tilt changes can be 
of negligible effect on its prediction of breathiness. 
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As seen, CPP can produce adequate response to most 
of the signal characteristics which have been related to 
many vocal quality dimensions (breathiness, roughness, 
hoarseness, nasality). If an orthogonal representation of 
the GRBAS scale is accepted [15], the CPP can be 
expected to be a better predictor of overall Grade, than of 
any individual dimension. This would occur because 
selective response of CPP to one particular dimension is 
affected by its sensitivity to the others. 

The next section explains the results of the CPP index 
in several reported studies in terms of the previous 
discussion. 


III. REPORTED STUDIES 


The studies covered in this section are the original CPP 
and CPPs by Hillenbrand et. al. [6] and Hillenbrand & 
Houde [7], and more recent studies by Heman-Ackah et. 
al. [3], Awan & Roy [4] and Maryn et. al. [5]. 


e = Hillenbrand et. al. (1994) [6]. 

This study consisted in the voluntary control of three 
breathiness phonation levels by 15 normal subjects on 
four vowels. The number of judges was high (20) and the 
rating scale was an unrestricted visual-analog (VA). The 
different acoustic indexes were calculated over three 
types of signals: the original, a band-pass filtered signal 
and a high-pass filtered version. CPP emerged as a very 
good predictor of breathiness ratings (Pearson’s r greater 
than 0.9, more than 80% of the variance explained by r°) 
with RPK in the band-pass signal showing similar results. 

This study intentionally limited the perturbation to 
breathiness. This has, according to the discussion in 
Section II, a twofold consequence. First, breathiness 
ratings do coincide with “grade” since it is the only 
deviant dimension, and second, the obtained correlations 
can be high because CPP is not affected by interference 
with other distortions. The possible influence of using 
non-pathological speakers is addressed in the analysis of 
the next study. 


e = Hillenbrand & Houde (1996) [7] 

Here a broad pathological database was screened to 
select 20 recordings presenting mainly breathiness, as 
well as 5 recordings from nonpathological subjects. 
Recordings included a sustained vowel as well as running 
speech. The number of judges was 20, and the scale used 
was unrestricted VA. CPP and CPPs were again the best 
predictors, with similar results (up to 85% and 92% of the 
variance explained in the running speech and sustained 
vowels, respectively). 

In this study no version of the RPK could match the 
performance of its equivalent cepstral measure (a best 
result of 72% of variance explained). A possible cause is 
that pathological speakers showed a stronger influence of 
first harmonic amplitudes and spectral tilt measures in 
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breathiness ratings than in the previous study, and CPP is 
better suited than RPK to reflect at least the former factor. 
Again, the restriction of deviant dimension to breathiness 
can explain the extremely high correlations obtained. 


e Heman-Ackahet. al. (2002) [3] 

Voices from 19 patients were available, preoperative 
and postoperative, in both a sustained vowel and running 
speech. Two judges rated grade, breathiness and 
roughness in a 120 mm VA scale. 

The results are lower than the ones in [6] and [7] with 
the cause being the absence of a selective screening of the 
deviant dimensions, which is more likely the case in the 
clinical practice. Here the results for grade (65%-75% of 
the variance explained) are considerably better than the 
ones for breathiness (50%) or roughness (20%-25%). 
These results are in complete correspondence with the 
discussion in Section II. 


e Awan & Roy (2005) [4] 

Recordings from 83 dysphonic and 51 normal female 
subjects were rated by 12 judges as belonging to four 
groups or voice types: normal, breathy, rough and hoarse. 
The degree of the dimension was not the goal of the 
study, only the type. 

The study found CPP to be good at discriminating 
normal from dysphonic voices, but it was not relevant for 
the separation among the different dysphonic types. A 
logarithmic shimmer measure was found best suited for 
the later purpose. This is also in correspondence with our 
analysis in Section II. CPP is similarly sensitive to 
breathy and rough signal characteristics, and can not be a 
reliable separator among them. 


e Maryn et. al. (2007) [5] 

This study comprised recordings from both a sustained 
vowel and running speech, from 229 patients and 22 
normal subjects. Samples were rated by five judges in the 
G, R and B dimensions of the GRBAS scale. 

CPP ranked again the best among all acoustic measures 
considered, and again the correlation was strong with 
overall grade and breathiness ratings. Results are the 
lowest reported (50% of the variance explained for grade) 
but the size of the database is also the largest, thus 
including more variability than previous studies. 


IV. DISCUSSION AND CONCLUSION 


According to the previous sections, CPP can be 
expected to appraise overall grade better than any other 
acoustic measure of vocal quality previously reported. If 
proper screening of samples is performed (i.e. limit signal 
deviation to a single dimension) CPP can produce 
extremely high values of correlation with the individual 
dimensions. 


95 


A significant reduction in the percent of variance 
explained occurs when considering signals with a wide 
range of variability, but even in that case, CPP can still 
perform as the best single predictor of overall vocal 
quality. Another point in favor of CPP is its similar 
performance on sustained vowels and running speech. 
The desirability of using running speech for acoustic 
measures has been pointed out in several studies [7][5], 
and only a small fraction of the existing measures can 
work on running speech. 

The usefulness of CPP is limited, though, in trying to 
separate the different dimensions of vocal quality. Its 
sensitivity to most of the relevant distortions found in 
pathological voice makes it better suited to predict grade 
than any individual dimension. Since the later is usually 
the case in clinical practice, complementary acoustic 
measures are needed to perform an accurate and 
exhaustive description of vocal quality in terms of 
objective measures. 
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Abstract: This study visualizes the glottal excitation in a 
temporally highly resolved estimate of the first formant. 
Instantaneous estimates during the glottal cycle of the 
frequency and bandwidth of the first formant closely 
follow the electroglottographic contour. This is 
demonstrated for phonation of an [a:] produced by one 
female patient with laryngeal dystonia. The observed 
contours in the FI frequency and bandwidth can be 
interpreted with reference to the current status of 
patients’ typical phonation behaviour before and after 
botulinum toxin (BTX) treatment. The temporally 
highly resolved formant frequency and bandwidth 
contours reflect glottal features such as the different 
durations of the open phase and fundamental frequency 
and/or amplitude perturbations of the vocal fold 
vibration. For example, diplophonia is identifiable in 
parts of the phonation. These results suggest the 
possibility of quantifying differences in the intra-cycle 
first formant contours according to different voice 
qualities. 

Keywords: Linear  prediciton, 
laryngeal dystonia, BTX- treatment 


electroglottography, 


I. INTRODUCTION 


In voiced speech the larynx produces a complex sound 
that excites the resonances of the vocal tract. Interpreting 
this situation in terms of narrowband spectral analysis, 
the source signal consists of the fundamental oscillation 
and higher harmonics. The harmonics which are close to 
the resonance frequency of the vocal tract get amplified. 
They convey information on the place and manner of 
articulation to the listener whereas the fine structure of 
the harmonic spectrum carries the voice quality. In this 
analysis the long window of the narrowband spectrum 
looks at the formants as slowly varying characteristics of 
the vocal tract. Spectral gradient measurements may be 
associated with voice quality to a certain extent [1], but 
the minimum vowel duration of about 50 milliseconds 
together with the necessary stationarity excludes many 
short vowels and diphthongs from analysis with this 
technique. 

On the other hand it is evident that the vocal tract 
changes in the degree of coupling to the subglottal cavity 
with the movement of the vocal folds during the glottal 
cycle. Widely opened glottal folds couple the subglottal 
cavity and lower at least the lowest — the first — formant 


and increase its bandwidth. In contrast the closed glottis 
acoustically decouples the subglottal cavity and leads to 
the highest frequency and the smallest bandwidth of the 
first formant. Immediately after the acoustic excitation by 
the completed glottal closure, the acoustic energy in the 
vocal tract is at its peak and this high-frequency low- 
bandwidth formant state is radiated most prominently. 

The aim of the present study is to visualize the glottal 
excitation for phonation of an [a:] produced by one 
female patient with laryngeal dystonia (spasmodic 
dysphonia, adductor type) before and after BTX- 
treatment. The common adductor type of laryngeal 
dystonia is characterised by irregular hyperadduction of 
the vocal folds leading to a strangled and hoarse voice 
quality with breaks in pitch and phonation [2]. 
Accordingly, the acoustic output is characterised by 
amplitude and frequency perturbation and diplophonia in 
parts of the signal. Diplophonia is a beat frequency 
phenomenon caused by the independent vibration of the 
two vocal folds or different parts of the vocal folds at 
different frequencies. 

In the present study instantaneous estimates of the first 
formant's frequency and bandwidth are undertaken 
closely following the electroglottographic contour. The 
instantaneous frequency and bandwidth of the first 
formant is estimated by the high temporal resolution 
linear prediction [3]. 


II. METHODS 


A. Signal Analysis 

Sustained vowels are considered here to be produced 
by a rapidly time varying system. The moving vocal folds 
themselves are the main source of this time variation. 
Their movements affect the acoustic termination of the 
vocal tract tube at its lower end. The aim of the analysis 
procedure are formant frequency and bandwidth 
estimates that track closely the changes of the acoustic 
resonator. Our approach is to use linear prediction within 
short signal frames of 3ms that do not deviate too much 
from the stationarity requirement of linear prediction. The 
uncertainty of the formant parameter estimates increases 
with shortened frame duration. It has already been 
demonstrated that this formant parameter uncertainty may 
be counteracted by polynomial regression prior to, and 
temporal smoothing posterior to the linear prediction 
[3,4]. 
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The processing starts with a standard first order 
preemphasis filter with its zero at 0.99. Since the aim is to 
calculate the formant oscillation parameters, the next 
processing step is the suppression of the fundamental 
waveform to a great extent with a linear high pass filter of 
400 Hz. The passband starts above the fundamental 
frequency of 250Hz and well below the classically 
expected frequency of the first formant of an open vowel 
around 800Hz. The passband is terminated at 2kHz to 
limit the influence of higher formants and high frequency 
noise. This choice includes both the first and second 
formant of [a:] in the analysis and it produces better 
results than the other tested alternatives: a lowpass filter 
cuttoff at 1.2kHz, 1.5kHz or 3kHz. 

The polynomial regression consists in subtracting a 
best matching constant, straight line or parabola from the 
signal in the analysed frame. This step is introduced to 
eliminate residual portions of the fundamental wave 
contour. Fig. 1 and 2 include only the subtraction of the 
mean and fig. 3 and 4 the subtraction of the best matching 
parabola. 

The order of the linear prediction is selected to be 51 
which corresponds roughly to a pole per kilohertz of the 
total bandwidth of the digitized signal and one pole on 
the real axis. The analysis frames are shifted 200 
microseconds yielding 5 frames each millisecond. 

The formant parameters scatter broadly around the 
time varying changes of the resonator. To visualize the 
formant track without noise, each 7 successive formant 
parameters are averaged and shown in the plots. This 
temporal smoothing results in a time resolution of 1.4ms. 
In order to ignore outliers and reduce mixing the first and 
the second formant, parameter estimates within the 
ranges of 400Hz<F1<1.2kHz and 50Hz<B1<600Hz are 
averaged. 


B. Speech material 

One female patient with adductor spasmodic dysphonia 
was asked to produce the vowel [a:] at a normal pitch. 
Electroglottogram (EGG) and microphone signals were 
recorded simultaneously, and both were digitised with a 
sampling rate of 50kHz and 16-bit amplitude resolution. 
The microphone signal was recorded using a headset 
condenser microphone (NEM 192.15, Beyerdynamic). By 
using a headset microphone, the distance to the lips 
remains constant during speech, independent of head 
movements [5] The EGG-signal was measured with a 
Portable Laryngograph from Laryngograph Ltd. Both 
signals were fed directly into a Computerised Speech Lab 
(CSL) station (model 4300B). 


III. RESULTS 
Three contours show the different states of the 


speaker's voice quality: the electroglottographic 
measurement as a phonation reference, the instantaneous 
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frequency estimate of the first formant (Fl), and the 
bandwidth estimate of the first formant (B1). The 
analysed signals shown in figs. 1 - 4 represent the voice 
quality (a) before BTX-treatment, (b) five days after 
BTX-treatment, (c) two months after BTX-treatment and 
(d) six months after BTX-treatment reflecting the state of 
relapse. 


A. Before BTX-treatment 

EGG and instantaneous formant measurements of 
hoarse voice quality are shown in fig. 1. The pitch cycles 
show partially strong fundamental frequency and 
amplitude perturbation. Accordingly, the EGG contours 
show variation from cycle to cycle indicating 
diplophonia. The F1 and B1 contours follow the EGG 
course less closely then in normal voice quality (modal 
voice) shown in fig. 3. 
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Fig. 1 Before BTX-treatment 


B. Five days after BTX-treatment 

The tendency of the F1 contour to follow the EGG 
contour is visible in fig. 2. Due to breathiness as a result 
of BTX-treatment, the open phase in the pitch cycle is 
very long. This long open phase is displayed by EGG and 
F1 frequency contours and by the high bandwidth of the 
first formant. 
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Fig. 2 Five days after BTX-treatment 


C. Two months after BTX-treatment 

Two months after treatment, the contours are 
comparable to modal voice quality. This voice quality has 
also been described in a recent study [4]. In fig. 3, the 
modal voice contours are shown for EGG, F1 and B1. 
The beginning of the closing phase of each pitch cycle is 
displayed as an ascent of the EGG contour. The ascent 
ends in the contact phase. The locally maximal contact is 
marked by the upper peak. After a delay of about 2 
milliseconds the same upward and peak course is visible 
in the first formant frequency. In contrast to this, the 
bandwidth of the first formant is minimal during the 
closed phase, i.e. the peaks in EGG and F1 are aligned 
with a B1 valley in fig. 3. The low bandwidth indicates a 
low loss of acoustic energy in this phase of the pitch 
cycle when the subglottal cavity is minimally coupled to 
the supraglottal vocal tract. The beginning opening phase 
of the glottal cycle is characterised by a decreasing 
contact of the vocal fold tissue. The EGG contour, 
displaying the electrical conductivity across the larynx, 
falls and reaches its valley when the vocal folds are open. 
Again, after the acoustic delay the frequency of the first 
formant in fig. 3 decreases and reaches a valley. This is 
interpreted as a decreasing cavity resonance frequency 
due to an increasing acoustic coupling of the subglottal 
and supraglottal cavities. The first formant’s frequency is 
minimal and its bandwidth is maximal during the open 
phase. The increasingly large bandwidth corresponds to 
an increasingly large loss of acoustic energy in the 
subglottal cavity. 
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Fig. 3 Two months after BTX-treatment 


D. Six months after BTX-treatment 

About six month after BTX-treatment, a state of 
relapse is observed (see fig. 4). As mentioned above (see 
section A.), the pitch cycles again show partially strong 
fundamental frequency and amplitude perturbation. 
Equally, the EGG contours show some variation from 
cycle to cycle, again displaying diplophonia in parts of 
the signal. The F1 and B1 contours once more follow the 
EGG course less closely. 
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Fig. 4 Six months after BTX-treatment 
IV. DISCUSSION 


First formant analysis with high temporal resolution is 
a promising candidate as a tool for the acoustic 
observation of changes in voice quality during treatment 
with BTX. This is confirmed by the observation of 
similarities between the electroglottographic contour and 
the frequency and bandwidth contours of the first 
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formant. The observed contours can be interpreted with 
reference to the current status of patients’ typical 
phonation behaviour before and after BTX-treatment for 
adductor spasmodic dysphonia. For the pre-BTX- 
treatment phase, fundamental frequency and amplitude 
perturbation as well as diplophonia are apparent in parts 
of the vocal fold vibration. For the phase immediately 
after treatment, breathy voice quality with a long open 
phase and a large bandwidth of the first formant is 
observed. This bandwidth of the first formant indicates a 
higher loss of acoustic energy in the subglottal cavity. 
The cavity is more strongly coupled to the supraglottal 
vocal tract. For the phase about two months after BTX- 
treatment, modal voice quality can be noted. The 
regularity of the glottal cycle can be seen in the 
oscillation of the first formant frequency and its narrow 
bandwidth. Finally, in the so-called relapse phase, pitch 
cycles with strong fundamental frequency and amplitude 
perturbation as well as diplophonia in parts can be 
detected again. 


V. CONCLUSION 


The observations in the present study seem to reflect 
closely the (patho)physiological behaviour of vocal fold 
vibration caused by a merely symptomatic treatment of 
laryngeal dystonia. Therefore, they may help to determine 
the actual voice quality status during this treatment. For 
the future it may be interesting to quantify differences in 
the first formant contours according to different voice 
qualities. 
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Abstract: The presentation concerns the estimation of 
the vocal tract length of a speaker on the base of her 
formant frequencies and the formant frequencies and 
known tract length of a reference speaker. The length 
prediction is founded on a rule inferred from 
Webster’s equation that describes the propagation of 
a planar acoustic wave in a loss-less vocal tract. The 
length prediction experiments have been cross- 
language, cross-gender and cross-corpora. Results 
show that the relative length prediction error is less 
than 3%, which is inferior to the error made assuming 
typical tract lengths of 15 and 17 cm for male and 
female speakers respectively. 


I. INTRODUCTION 


This study is devoted to the estimation of the vocal tract 
length of a speaker by means of his formant frequencies 
and the formant frequencies and default tract length of a 
reference speaker. 

Several studies have been devoted to the topic of tract 
length estimation, because one may argue that the tract 
length is an anatomical cause of inter-speaker variability 
[2]. Possible applications of predicting tract lengths from 
acoustic data are speaker normalization and the 
facilitation of acoustic-to-articulatory inversion [1]. A 
majority of studies have focused on length normalization 
with a view to achieving speaker normalization, without 
attempting to estimate the tract length explicitly. 

The default length is the vocal tract length with lips 
and larynx in neutral positions. Lip rounding or spreading 
and larynx raising or lowering are phonetically relevant 
gestures that mark vowel timbre and which overlay a 
speaker’s anatomically conditioned default length. 

Several methods have been used to estimate the vocal 
tract length from speech. One is based on a known 
formula that relates the length of a uniform loss-less 
acoustic tube to its natural frequencies when the tube is 
open at one end and closed at the other. The vocal tract 
length is estimated by averaging several length values 
obtained by means of several observed formants [3], [4]. 

Paige et a/. have proposed to estimate the tract length 
using low-order poles and zeros of the lip impedance, 
omitting the assumption of uniform cross-sections [5]. 
The lip impedance poles correspond to the natural 
frequencies of the tract closed at both ends, which cannot 


be measured from the speech signal directly. Be that as it 
may, Paige’s approach has in common with [1] that it 
aims at obtaining length estimates on the base of acoustic 
data only. 

The method we have investigated enables estimating 
the unknown vocal tract length of a speaker by means of 
his formant frequencies as well as the known vocal tract 
length and formant frequencies of a reference speaker. 
The experiments that have been carried out include 
predicting tract lengths across genders and linguistic 
communities. The focus has been on default tract lengths, 
because the general framework has been acoustic- 
articulatory inversion, which assumes the default lengths 
to be known and the deviations therefrom to be 
computable. 


II. METHODS 


The method is based on an observation, made by 
Ungeheuer, concerning Webster’s equation, which 
describes the propagation of planar loss-less acoustic 
waves in non-uniform ducts [7]. Webster’s equation 
suggests that when the longitudinal dimension of an 
acoustic tube is multiplied by a constant, its natural 
frequencies change inversely proportional to that same 
constant. Applying this observation to the vocal tract 
would suggest that multiplying the length of the vocal 
tract by a number causes the formants frequencies to be 
divided by the same number. Mol, for instance, has tested 
this prediction by means of the Peterson and Barney data 
[8] by displaying the first and second formant averages 
for men, women and children in a chart and observing 
that the averages are positioned for each vowel on a 
straight line through the chart origin [6]. 


A. Estimation of the factor of proportionality 


Because the first three formants of all vowels are 
assumed to obey the rule of inverse proportionality, one 
calculates as follows the multiplicative constant a, which 
is assumed to explain inter-speaker formant differences 
owing to default length differences. 
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(1) 


Symbol Fg designates the formants frequencies of a 
set of vowels of a reference speaker whose average vocal 
tract length is known and F, the formant frequencies of a 
set of vowels of the speaker whose vocal tract length is 


unknown. Symbol N equals the number of formants per 
vowel time the number of vowel categories. It is desirable 
that the vowel categories and the number N of formant 
frequencies are identical for the target and reference 
speakers, because of the vowel-typical vocal tract 
lengthening and shortening that must be averaged out 
when the goal is the estimation of the default length. 

Once factor of proportionality, a, has been obtained, 
the unknown tract length L can be estimated via the 
known tract length of the reference speaker. 


L=0L,y (2) 


B. Corpora 


Corpora are divided into reference and test corpora. 
The first and the second reference corpora comprise the 
vocal tract lengths and first three formant frequencies of 
10 French vowels sustained each by 4 speakers (2 males 
and 2 females) [9] and one male speaker [10] 
respectively). Hereafter, these speakers are labeled MS), 
MS), FS), FS; and MS). 

A third reference corpus comprises the tract lengths 
and first three formant frequencies for 10 American- 
English vowels sustained by one female speaker [11] 
(labeled FS 4g). 

The formant frequency data published in the 
framework of these corpora have been obtained via 
measured vocal tract cross-sections and lengths combined 
with acoustic models. The purpose has been to guarantee 
the best possible match between published acoustic and 
morphological data. 

This is, however, problematic when the objective is to 
test relations (1) and (2) because for these corpora the 
formant frequency data cannot be assumed to be 
independent of the model the predictions of which they 
are expected to validate. 

Therefore, only those corpora have been retained as 
test corpora for which the formant frequencies have been 
determined from the speech spectra directly, loose from 
any Webster’s equation-based modeling. One test corpus 
comprises the tract lengths and formant frequencies 
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measured for one male speaker who has sustained 10 
American-English vowels [11]. The second test corpus 
comprises the vocal tract lengths and three formant 
frequencies of 5 Russian vowels produced by one male 
speaker [12]. The American English and Russian 
speakers are hereafter labeled MS, and MS 
respectively. 

The area functions and lengths published in [9] and 
[11] have been obtained by nuclear resonance imaging. 
The cross-sections and lengths in [12] are the well-known 
Russian vowel data published by G. Fant. They have 
been compiled on the base of X-ray images. The shapes 
and lengths published in [10] have been recorded by a 
combination of phonetic a priori knowledge, visual 
inspection of human speakers and X-ray imaging. 

The default length for each speaker has been obtained 
by averaging the vowel-typical lengths. 


III. RESULTS 


A. Experiment 1 


The experiment consists in predicting the vocal tract 
length of American-English test speaker MSyg by means 
of each reference speaker in turn. Table 1 shows the 
length prediction results. One sees that the absolute 
maximum relative error is less than 2 %. 


Table 1: Relative error in % and proportionality factors a 
obtained for American-English male test speaker MS. 
Symbol L is the measured default length. 


Test: MS ¿y 
L=17.14cm 
Relative error 

References a (%) 
MS, 0,93 -0,15 
MS, 0,93 -0,54 
FS, 1,08 -1,77 
FS, 1,03 0,91 
MS, 0,96 -0,91 
FS ¿y 1,25 -0,06 


B. Experiment 2 


The experiment consists in predicting the vocal tract 
length of Russian test speaker MSp by means of each 
reference speaker in turn. This experiment has involved 
five of Fant’s Russian vowels [12]. The number of 
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vowels has been the same for all speakers. The Russian 
and reference vowel qualities have been chosen to be as 
similar as possible.Table 2 shows the length prediction 
results. One sees that the absolute maximum relative 
error is less than 2.5 %. 


Table 2: Relative error in % and proportionality factors a 
obtained for Russian male test speaker MSx. Symbol L is 
the measured default length. 


Test: MSp 
L=17.6cm 
Relative error 

References a (%) 
MS, 0.95 -1.12 
MS, 0.96 0.13 
FS, 1.13 -2.41 
FS, 1.08 -0.29 
MS, 1.01 -0.96 

FS ¿y 1.27 -0.47 


C. Experiment 3 


The experiment involves speakers MSp and MS4z as test 
and reference speakers respectively. Then the 
proportionality factor a is equal to 1.04 and the relative 
error equal to -1.18 %. Inverting the roles of speakers 
MSx and MS4g gives rise to the same relative error in 
absolute value because relation (2) shows that estimating 
one length from another and vice versa means replacing 
constant a by 1/a and the relative error by its negative. 


D. Experiment 4 


This experiment has been carried out with the six 
speakers originally assigned to the reference corpora. 
Within this experiment, each speaker has been given in 
turn the role of “reference” speaker from whom the 
lengths of the other five speakers are predicted. Table 3 
reports the proportionality factors a above the main 
diagonal and the relative error in percent below the main 
diagonal. The line indexes refer to “reference” and the 
column indexes to “test” speakers. In Table 3, the 
maximum relative error is less than 3% whoever the 
“reference” speaker. 

One should keep in mind that predicting the lengths 
of speakers belonging to these corpora is a necessary, but 
not sufficient, test. The reason is that for these speakers 
the formant frequencies have not been obtained 
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independently of Webster’s equation relation (2) is a 
consequence of. 


Table 3: Relative error in % (below the diagonal) and 
proportionality factors a (above the diagonal) for 6 
speakers [9,10,11], each taking the role of “reference” 
speaker in turn. Symbol L is the measured default length 


in cm. 
MS; MS; MS; FS; FS) FSur 

L=18,54 | L=18,24 | L=18 | L=16,01 | L=16,42 | L=13,73 
MS; 0,99 0,96 0,85 0,9 0,74 
MS, | 0,69 0,97 0,86 0,9 0,75 
MS; | -0,76 -1,46 0,88 0,93 0,77 
FS; | -1,62 -2,33 | -0,86 1,05 0,87 
FS) 1,06 | -0,38 1,81 2,64 0,83 
FSig| 0,09 -0,6 0,85 1,68 -0,98 


E. Comparison with standard assumptions 


Often one assumes that the standard vocal tract length for 
men is 17 cm and for women 15 cm. One question is 
whether predicting the tract lengths by means of formant 
frequencies and the data of a reference speaker causes 
relative errors that are smaller than those that would have 
been obtained by making the above default assumptions. 
One sees in Table 4 that these assumptions cause relative 
errors between -9.2% and +8.6%. The (absolute) average 
is 6.1%, which must be compared to the average of 
0.72% of Table 1 and 0.90% of Table 2. Table 4 therefore 
suggests favouring length prediction over length 
standardization via default values. 


Table 4: Relative length error in % using standard tract 
lengths of 17 and 15 cm for males and females 
respectively. Symbols Logs and Lsrp designate the 
observed length and the standard length respectively in 
cm. Symbol £ designates the relative error in %. 


MS, FS, | FS) MS | FSagz | MSk 


Logs | 18,54} 18,24] 18 | 16,01 17,14 | 13,73 | 17,6 


Lsip| 17 17 17 15 15 17 15 17 


E |831 5,56 | 6,31 0,82 | -9,25 | 3,41 


F. Correlation between factors of proportionality a and 
length prediction errors 


Relation (2) is applicable to arbitrary test and reference 
lengths, whatever the length difference. Relation (2) 
therefore predicts that no correlation is expected between 
calculated length errors and constants of proportionality 
a. For the grouped Experiments 1 and 2 and for 
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Experiment 4, the correlations between calculated lengths 
errors and factors of proportionality are -0.2494 and - 
0.1493 respectively. These are not statistically 
significant. 


IV. DISCUSSION AND CONCLUSION 


a) Results suggest that estimating unknown tract lengths 
via measured formant frequencies and a reference tract 
length is a valid method that causes errors that are smaller 
than those made by assigning standard lengths to male 
and female tracts. 

b) The different experiments have involved cross- 
linguistic & cross-gender length predictions. The results 
suggest that these cross-factor predictions do not cause 
length estimation errors to be larger than within-factor 
predictions. A possible explanation is that observed errors 
are the combined effect of measurement errors 
(morphological and acoustic), the disparity between the 
recording conditions of acoustic and length data (which 
may not have been simultaneous) as well as the 
disagreement between predicted and recorded data, and 
that these combined errors are larger than the average 
errors caused by cross-linguistic vowel category or 
gender mismatch. 

c) Length estimation errors and factors of proportionality 
a are not statistically significantly correlated. This is an 
indirect test of the validity of relation (2). Indeed, if 
relation (2) were a crude approximation only of an 
unknown relation between the vocal tract lengths of two 
speakers, one would expect to observe increasing length 
estimation errors with increasing factors of 
proportionality. The reason is that linear relation (2) is 
then expected to approximate that link the better the 
smaller the difference between the reference and test tract 
lengths. The lack of observed correlations suggests, 
however, that identity (2) is a valid approximation of the 
relation between the default vocal tract lengths of two 
speakers, whatever the difference in vocal tract size, as 
long as up to three formants are involved in the 
comparison. 
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Abstract: The study deals with mathematical modeling 
of the vocal fold self-oscillations related to estimation 
of the so-called output-cost-ratio (OCR), which is 
computed from the numerically simulated sound 
pressure level at the glottal level and the impact stress 
(IS) during vocal folds collision. The dependence of 
OCR on prephonatory glottal width, fundamental 
frequency and lung pressure is discussed and partly 
compared with a modified output cost ratio measured 
in humans, where the closed quotient is used instead 
of IS. 

Key words: Biomechanics of voice, 
simulation of vocal folds vibration. 


numerical 


I. INTRODUCTION 


Impact stress (JS, i.e. the impact force divided by the 
contact area) has been regarded as the main loading factor 
in voice production and the most plausible cause of vocal 
fold traumas like nodules. To quantify the cost of voice 
production, Berry et al. [1,2] presented a parameter called 
output-cost-ratio (OCR), which concerns the acoustic 
output in relation to JS: 


OCR = 20 log Pap /Py — 20 log 15/15) , (1) 


where Psy» is supraglottal acoustic sound pressure 
(measured at a distance of 15 cm above the glottis of an 
excised canine larynx), Po and 1Sy are constants. 

IS is difficult to measure directly in humans. The 
present study investigates the output-cost-ratio using an 
aeroelastic model of voice production. The aeroelastic 
model of vocal folds vibration enabled to study the 
output-cost-ratio OCR in more details than in the 
experiments with the excised larynges. The influence of 
various parameters on OCR can be studied separately and 
in a more controllable way. 

It has been found that closed quotient (CQ, i.e. closed 
time of the glottis divided by the period length) obtained 
from electroglottographic (EGG) signal correlates with ZS 
- see Verdolini et al.[9]. Laukkanen et al. [7] have tested 
in human subjects the so-called Quasi-Output-Cost ratio 
where CO has been used instead of JS. 

The present study compares results of OCR obtained 
with modelling to some of the results obtained for human 
subjects by Laukkanen et al [7]. 


II. METHOD 


IS magnitudes and sound pressure level (SPL source) 
above the glottis were quantified using an aeroelastic 


computer model of the vocal fold self-oscillations 
employing the Hertz model of impact forces during vocal 
fold collision - see [4,5]. The model is based on a two- 
degrees-of-freedom dynamic system allowing rotation 
and translation of the vocal-fold-shaped element vibrating 
on two springs and dampers - see Fig. 1. Self-oscillations 
are excited by nonlinear aerodynamic forces resulting 
from the fluid-structure interaction. 


The impact Hertz force is given as Fy =k, 6°", 


where ky is the contact stiffness and 6 is the penetration 
of the vocal fold through the symmetry axis during 
collision. ZS was calculated as the maximum value during 
one oscillation period according to the formula: 


1523 Fine — Ei | (+) 


F max ? (2) 
2 TE H,ma 


=k, 632 , k pri, r is the 
=y 


where F, pesi = 


H,max 


radius of the curvature of the vocal fold model at the 
contact point, E is Young modulus and v is Poisson 
number; for E = 8000 Pa, v= 0.4. A parabolic shape of 
the vocal fold surface was considered, which gives the 
radius r. For the on-line numerical simulations in time 
domain, the resulting system of four Ist order ordinary 
differential equations describing the vocal fold vibrations 
was solved by the 4th order Runge-Kutta method. 


290 
E a 


Figure 1. Schema of the aeroelastic model — [4] 


In calculating OCR with the model, values of Py = 
20 uPa and JS) = 1 Pa were used. Prephonatory glottal 
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half-width was set as g=0.2 — 0.5 mm, i.e. the glottal 
width varied between 0.4 and 1 mm. Fundamental 
frequency FO was set to 100 and 400 Hz. Using the 
model presented here, it was not possible to use negative 
pre-phonatory glottal width (corresponding to pressed 
phonation), which Berry et al. [1,2] also used. In the 
present study, the lung pressure (Pung) and airflow values 
were set within the range reported for healthy humans, 
(Phung < 3000 Pa, airflow rate O < 0.8 //s — see Hirano [3]). 
The computations were realized in the range of Pring from 
the phonation threshold pressures (.P, ) to the phonation 


instability pressure (PIP). 

In measurements, the data were obtained from human 
subjects (see - [7]). The subjects were 62 females 
producing [pa:p:a] 5 times loudly. The sound pressure 
level (SPL) was registered at 40 cm from the subject’s 
lips, closed quotient COrgg was calculated from EGG 
signal. The acoustic signal was recorded using a digital 
recorder and B&K 4164 microphone, and EGG signal 
was registered with Glottal Enterprises dual-channel 
EGG. Oral pressure was registered with MSIF-II (Glottal 
Enterprises). The oral pressure during voiceless plosive 
[p] was used as an estimate of subglottic pressure. The 
acoustic signal was analyzed for mean F0 and SPL using 
Intelligent Speech Analyser (ISA) signal analysis device 
(developed by Raimo Toivonen, M.Sc. Eng). COrgg , 
vibration period T (F0=1/7) and the mean oral pressure 
during [p] were measured by using a custom-made 
program for measurement of AC- and DC signals 
(developed by Heikki Alatalo, DSP-Systems). 


HI. RESULTS AND DISCUSSION 


Figure 2 shows the simulated SPLsource values, at the 
upper end (x=L) of the glottis, for all considered 
prephonatory glottal half-widths g as a function of lung 
pressure, which is presented as a dimensionless 
normalized excess subglottal pressure 
Pa =(P 


sen lung 


-P,)/P, (3) 


As expected, after crossing the phonation onset at the 
phonation threshold pressure P,, the SPL increases 


‘source 


with the pressure P, for all g values in a nearly linear 


sen 


way. The highest SPL 


source Values are reached for 
g=0.5 mm near the PJP, where the lung pressure values 
are at a maximum. 

The JS values obtained with the model (see Fig. 3) are 
in the range of the data reported for living subjects and 
excised human and canine hemilarynges (see - [1,6,9]). ZS 
increased with the lung pressure reaching a plateau when 
getting close to the PIP values. Again, the maximum 
values of ZS were obtained for g=0.5 mm near PIP where 
also a lung pressure maximum occurs. Nearly zero ZS 


values are near P,,i.e.near P,,=0. 


The OCR calculated according to the equation (1) from 
the simulated SPL, . and JS values is shown in Fig. 4. 


‘source 
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@g=0.2mm @g=0.3mm &g=0.4mm @g=0.5mm 


SPL source (d B) 


P sen 


Figure 2. Computed SPLsource versus normalized excess 
subglottal pressure Pep. (FO=100 Hz). 
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Figure 3. Computed IS versus Psen. (F0=100 Hz). 


The maximum of OCR appears near P, =0 due to the 


very low JS values near the phonation threshold. For all 
prephonatory glottal half-widths g the OCR decreases 
with P,, having minimal values at about P,, ~1.5, 
thereafter the OCR values slightly increase up to the PIP 
values, where the JS reaches a plateau, while SPL, 


still increases (compare Fig. 4 with Figs. 2,3). We can 
note that according to the model and the definition (1) of 
the OCR parameter, the most advantageous (economic) 
regime would be to phonate near the phonation onset. It 
seems to be a peculiar but trivial and expected result, 
because at P, there are none or very small impacts 


(IS +0) and therefore OCR theoretically goes to the 
infinity (OCR + +00). 
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Figure 4. Computed OCR versus Psen. (F0=100 Hz). 
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The OCR values varied with the prephonatory glottal 
width g in dependence on Ping in a qualitatively different 
way, as the present settings were used (see Fig. 5). 
According to the results by Berry et al. [1] for 
F0=150 Hz, a prephonatory glottal width of 2 mm was 
optimal (gave the largest SPL with the lowest JS) in 
excised canine larynges, while with their model of the 
vocal folds with a vocal tract, a width of 1 mm was 
optimal (OCR values reached the maximum). Later Berry 
et al. [2] reported a broad maximum in the OCR curves at 
about 0.6 mm for excised canine larynges when Pu was 
varied in the range 1 — 1.6 kPa. The results of the present 
study suggest that the optimal glottal width is dependent 
on the lung pressure (see Figs. 4 and 5). At low Piung 
values a larger prephonatory glottal width seems to be 
more economic, while at high Pings values a smaller 
width would be more preferable. It should be noted, 
however, that using the present aeroelastic model, 
phonation with really small glottal widths (corresponding 
to pressed phonation) was not possible to model. 


O Plungs=400Pa 
@ Plungs=900Pa 
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Figure 5. Calculated output cost ratio OCR versus 
prephonatory glottal half-width g for F0=100 Hz and 
Piung= 400 and 900 Pa. 


Because /S is difficult to measure in humans, CO may 
be used as a substitute for it, based on the fact that there 
is a correlation between CO and /S reported in excised 
canine larynges [9]. The relation between /S and CO 
obtained with the model of the present study is shown in 
Fig. 6. In general, we can suppose the relation in the 
following form: 

1IS=aCQ?, (4) 


where a and b are constants dependent on g as shown in 
Fig. 6. The exponent varied from b=1.2 to 3.7 in 
dependence on the prephonatory glottal half-width. 

After substituting ZS from equation (4) to the formula 
(1) the OCR can be approximated by a Modified Output 
Cost Ratio parameter defined as 


MOCR = SPL source - 20*2* log CO + const., (5) 


where the constant b in equation (4) was approximated by 
the value b=2 for all prephonatory glottal half-widths 
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considered. The computed MOCR is shown as function of 
the normalized subglottal pressure in Fig. 7. 

The MOCR parameter calculated from the data 
measured in humans is shown in Fig. 8 as function of the 
subglottal (oral) pressure. The trend, i.e. the increase of 
MOCR with Py, is in good agreement with the modeled 
data presented in Fig.7 for the higher Pen values. 
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Figure 6. Computed JS versus CO for various 
prephonatory glottal half-widths. (FO= 100 Hz). 
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Figure 7. Computed MOCR versus Pen for various 
prephonatory glottal half-widths g. (F0= 100 Hz). 
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Figure 8. Measured MOCR versus subglottal pressure in 
humans - [7]. (Number of the subjects: N = 28). 


The influence of the fundamental frequency on 
MOCR is demonstrated in Fig. 9, where the computed 
modified output cost ratio is shown as function of the 
lung pressure for FO=100 Hz and F0=400 Hz, again for 
all prephonatory half-widths g considered. The tendencies 
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of MOCR changes from a maximum at P, pressure 


through a minimum to another maximum at PIP pressure 
values are similar for both F0 values, however, the values 
of MOCR are higher for the higher fundamental 
frequency F0=400 Hz. 
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Figure 9. Computed MOCR versus Ping for various 
prephonatory glottal half-widths g for F0=100 and 
400Hz. 
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Figure 10. Measured MOCR versus 
frequency FO in humans - [7]. 


fundamental 


The modelled influence of the fundamental frequency 
is in good qualitative agreement with the data measured 
in humans as shown in Fig.10, where the increase of 
MOCR parameter with F0 is obvious. 


IV. CONCLUSIONS 


The present study tested an output-cost ratio parameter 
which is supposed to reflect economy of voice 
production. An aeroelastic model of vocal fold vibration 
and material recorded from female subjects were used. 
Results obtained with modeling corresponded to those 
obtained from humans. 

Based on the results, it looks like that the rate of SPL 
rise in relation to P,,, and FO exceeds the rise in JS. This 
results in the fact that OCR does not correspond to the 
clinical and pedagogical observations suggesting that 
using loud phonation and high pitch (for an excessively 
long time) increases the risk of vocal fatigue and vocal 
fold traumas. A more complicated parameter, taking into 
account the effects of FO (=increased number of 


MAVEBA 2007 


collisions in time), loading aerodynamic and inertia 
forces caused by the acceleration of the vocal fold tissue 
might better reflect the mechanical vocal fold loading and 
thus better describe the economy of voice production. 
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OPENING AND LARYNX POSITION 
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Abstract: The simultaneous assessment of the status 
of the glottis opening and the position of the larynx 
can be beneficial for the diagnosis of disorders of 
voice production and swallowing. The method pre- 
sented here makes use of a time-multiplex algo- 
rithm for the measurement of space-resolved trans- 
fer impedances through the larynx. The fast se- 
quence of measurements allows a quasi simultane- 
ous assessment of both larynx position and EGG 
signal in 32 channels. First results indicate a high 
potential of the method for use as a non-invasive 
tool in the diagnosis of voice dysfunction, ventricu- 
lar fold phonation and swallowing disorders. 
Keywords : Voice assessment, larynx position, 
tomography, transfer impedance, EGG 


I. INTRODUCTION 


Complex phonatory manoeuvres such as swallowing 
and some singing styles require a synchronous adduc- 
tion/abduction and change of larynx position. In the 
case of dysfunction, the synchronization can be dis- 
turbed, and a temporary or persistent dislocation of the 
larynx might occur. The simultaneous assessment of 
larynx position and glottis opening requires costly im- 
aging devices with high spatial resolution such as sono- 
graph, CT or MRT and — at the same time — sufficiently 
high temporal resolution such as EGG [1,2]. 


Non-invasive methods for assessment of dislocation of 
the larynx and glottal dynamics would be beneficial for 
the ambulant diagnosis of voice-, speech-, and swallow- 
ing disorders. However, current methods are either in- 
vasive (CT) and/or require high costs/time (MRT), and 
do not offer the possibility of simultaneous observation 
of the glottis dynamics with EGG (see Fig. 1). 


The EGG device EG2 from Glottal enterprises allows 
the evaluation of the relative amplitude between two 
channels using four electrodes [3]. This feature is useful 
for positioning of the electrodes but does not seem to be 
applied to larynx position measurements yet. 


EGG preseniofion 


+ + daph y 
Figure 1: simultaneous assessment of larynx position 
and EGG signal 


II. METHODS 


This approach presents a device with a time-multiplex 
method for simultaneous acquisition of up to 36 chan- 
nels within one phonation cycle ([4], see Fig. 2). A 
3 MHz carrier signal is generated and fed to a multiplex 
unit that temporally distributes the signal to 6 electrodes 
that are organized in a 2x3 matrix. The same matrix 
form is chosen for the receiving electrodes that subse- 
quently are connected to a de-multiplex unit, demodula- 
tor and preamplifier. 


The signal generation, synchronization, control and 
evaluation of the transfer paths is performed with Lab- 
VIEW and a 200 kSample DAQ card. 
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Figure 2: Set-up of the 36-channel device for assess- 
ment of larynx position and EGG signal 


The following properties characterise the set-up: 
e Discrete, sine wave signal generator with 
2 MHz carrier; current is fixed with I < 10 mA 
e Galvanic isolation; high speed CMOS time 
multiplexing 
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e = Electrodes with 1 cm diameter; 2x6 arrays; op- 
tional use of contact gel; placement of the elec- 
trode array with elastic bands at the height of 
the glottis 

e Demodulation of the received signal; the am- 
plitude of the signal represents the conductiv- 
ity; synchronous sampling yields the conduc- 
tivity and the EGG signal time series for each 
channel 


Fig. 3 details the scheme for subsequent acquisition of 
the glottis status using the 36-channel multiplex ap- 
proach. Each path represents an EGG measurement for 
a specific sender-receiver combination. During one 
phonation cycle all 36 paths are sampled subsequently 
several times. 


voltage measured 
at each channel 


36 channel EGG signal 
generated by sample and hold circuits 


IM gni 


1)2/3/4/5|6/7|°°°* (34/3536 


1)2/3|4/5]6/7|°°"* (34/3556 


1123 f4 |5 le [71°77 (34/3536 


Figure 3: Concept of time-multiplex acquisition of the 
glottis status. During one phonation cycle the 36 paths 
are sampled several times. 


After each set of 36 channels the acquisition is paused 
for a time tau (see Fig. 4) and then the acquisition is 
continuously repeated until a user interrupt is detected. 
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Figure 4: Repeated acquisition of the 36 channels. 
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At the time of acquisition of channel one, a hardware 
handshake signal is generated by the quartz synchro- 
nized micro controller switching unit and fed to the 
DAQ card, allowing a very precise timing control of the 
signal processing. 


The results are displayed in real-time on a LabVIEW 
Virtual Instrument. Simultaneously to the conductivity 
of all 36 channels an indicator is shown which repre- 
sents the overall quality of the contact between elec- 
trodes and skin. The position of the path(s) with highest 
conductivity is indicated as a moving ball on a two- 
dimensional plot. 


In the software, a simple algorithm is used to calculate 
vertical and dorsal-ventral movement of the larynx from 
a comparison of impedance amplitudes in opposite 
paths. 


A major problem is the need for fast switching between 
the electrodes to allow an accurate representation of the 
EGG curves. Since the electrodes and their leads have 
non-negligible capacitances at 3 MHz, some effort must 
be made to reduce the transition times in order to obtain 
satisfactory results at higher multiplex rates. For this 
reason the switching has been entirely been imple- 
mented in hardware using CMOS switches with dedi- 
cated switching properties. 


III. RESULTS 
A. Performance evaluation of the method 


The time-multiplex approach allows a very effective 
separation of the different measurement channels. The 
measurement rate between two time slots could be re- 
duced to 23 us, corresponding to a sample rate of 
44,100 Hz. The maximum sample duration within such 
a time slot was 14 us. 


With 36 channels a glottal cycle could be sampled at 
1,225 Hz in each channel which should be sufficient for 
low-pitched voices. For the EGG analysis of higher 
pitches, the number of channels could be reduced. 


The function of the position measurement was tested by 
a trained speaker who, in sequence, closed the glottis 
and the intra-oral space using his tongue. As a result, 
the ball on the 2D-plot, indicating the location of high- 
est conductivity, jumped according to the closure of 
glottis or mouth cavity to the bottom or the top of the 
plot. 


Posters 


B. Application to singing and swallowing 


The 36 channel device has been tested with healthy sub- 
jects performing swallowing manoeuvres. The result 
from the position detection algorithm was visualised as 
a trajectory in the 2D-space. For subsequent swallowing 
tasks of the same subject similar patterns of the trajec- 
tory were observed. 


In earlier studies [5], a two-channel device (EG2) has 
been successfully applied to simultaneous EGG and 
larynx position measurements of healthy subjects per- 
forming phonatory manoeuvres such as sweep singing 
or swallowing. The results from a sweep analysis are 
shown in Fig.5. 


i rE = me am = wm fi na = 


Figure 5: Analysis of a sweep signal, sung from a male 
healthy subject with register break between modal and 
head register. Top: sound pressure; Centre: EGG signal; 
Bottom: height information. 


In the EGG signal the transition from modal to head 
register is seen from the reduced amplitude between 
about 2.8 and 5.5 s. In the height signal, an increase of 
the larynx position seems to coincide with the register 
transition. 


An evaluation of the complete spatial information from 
the 36 paths of the new approach will reveal a more 
detailed view on the two-dimensional movement of the 
larynx. We hope these results will be available at the 
time of the conference. 


IV. DISCUSSION 


The non-invasive assessment of spatial information 
from the 36 channel measurements has the potential to 
accurately indicate changes in the position of the larynx. 
However, some problems must be solved before the 
device can be used for medical applications. 


The normalization of the spatial information requires a 
calibration routine which allows the evaluation of indi- 
vidual reference positions for different subjects. We 
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currently investigate the application of optical methods 
which offer satisfactory repeatability and accuracy. 


Preliminary results indicate that the spatial distribution 
of more complex manoeuvres such as swallowing ex- 
tends over a range which is larger than the actual area 
covered by the 12 electrodes. Possibly even more elec- 
trodes would be beneficial for an improved resolution. 
The realization of such even more complex set-up 
would probably require a different concept. 

Future implementations of the soft- and hardware con- 
cept should include a PC-based control of timing and 
switching scheme. 


V. CONCLUSION 


The multi-channel EGG system has a number of advan- 
tages compared to one- or two-channel EGG devices. It 
allows the simultaneous measurement of EGG signal 
plus the evaluation of the larynx position in real-time. 


The calibration of the position measurement is a topic 
of current work, and several options are evaluated, in- 
cluding numerical methods and measurements on mod- 
els and humans. Investigations of the accuracy and 
space resolution of the method in clinical studies are 
also planned. 


Future applications of the method include the experi- 
mental study of complex phonation processes such as 
the combined vocal-ventricular fold phonation. The 
method has the potential to perform an analysis of con- 
current oscillatory patterns with high resolution, both in 
time and space. 
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Abstract - A multi-purpose software tool (BioVoice) is 
presented, capable of performing automatic analysis of a 
large range of voice signal, no manual setting being 
required to the user. This makes the tool feasible for 
application by non-expert users in several fields, ranging 
from high-pitched new-born cries, to adult healthy 
singing vocalizations and to irregular, pathological voice 
signals. Main voice characteristics (fundamental 
frequency and formants) are evaluated and tracked by 
means of robust analysis techniques that can handle the 
above mentioned wide range of signals, as internal 
settings for optimal frame length, frequency range of 
analysis and plots are automatically adjusted. 

Specific parameters are evaluated according to the kind 
of signal under study, and displayed with suitable plots 
and tables. 

In this paper, the method is applied to patient affected 
by laryngeal hemiplegia that underwent lipofilling 
treatment to recover phonatory capabilities. 

Keywords: multi-purpose voice analysis tool, robust 
parameter estimation, laryngeal hemiplegia. 


I. INTRODUCTION 


Voice analysis is of great relevance in several fields, ranging 
from newborn infant cry to singing voice and to hoarse adult 
voices. Hence, paediatricians, surgeons, but also singing 
teachers, psychologists and logopedicians are involved with 
this field of research. Nowadays, several analysis techniques 
and reference values have been proposed in literature and 
are in use. A huge number of indexes is available, some of 
which implemented in free or commercially available 
software tools [1], [2]. However, users often resort to a 
small subset of such indexes, due to difficulties in 
understanding subtle differences among parameters, and to 
deal with rather technical options, especially concerning 
spectral analysis. Moreover, often commercial software 
suffers some limitation, linked to the implemented analysis 
techniques that sometimes prevent the analysis of high- 
pitched and/or highly degraded voices. 

The BioVoice tool proposed here aims at providing few 
objective parameters and plots, easily understandable and 


manageable by a wide range of users. The proposed 
software tool performs single or comparative analysis of 
main voice characteristics (fundamental frequency and 
formants) by means of robust analysis tools, specifically 
devoted to deal with a wide range of pitch values, and 
possibly highly degraded signals. At present, three main 
categories are considered with BioVoice: newborn infant 
cry, singing voice and adult hoarse voice. 


IL METHOD 


Basic voice characteristics (fundamental frequency (Fy) and 
formants) are evaluated and tracked by means of robust 
analysis techniques that can handle the mentioned wide 
range of signals. To this aim, automatic adjustment of 
internal settings for optimal frame length, frequency range 
of analysis and plots are implemented. 

First, the signal is divided into short frames, whose length 
adaptively varies according to varying signal characteristics: 
the higher the Fy the shorter the frame length (kept fixed to 3 
pitch periods). A voiced/unvoiced (V/UV) separation 
algorithm is implemented, to avoid Fy estimation on signal 
frames that have no harmonic content and could give 
misleading results. 

F tracking is achieved by means of a two-step procedure, 
based on well-established results: the AMDF approach is 
applied to a wavelet-smoothed SIFT estimation of Fo, with 
optimised and varying adaptive filter order [3], [4]. 

Robust and high-resolution formant (resonance frequencies) 
estimation is implemented, based on parametric 
AutoRegressive (AR) PSD evaluation. The AR model order 
p is automatically selected by the program according to 
patient and signal characteristics, based on the relationship: 
p=2LF,/c, where: F,=sampling frequency, L=vocal tract 
length (linked to patient's age and sex), and c=sound speed 
[4]. Colour-coded spectrograms are also provided, with the 
tracking of formants F; superimposed, whose number and 
frequency range depends on the signal under study. Mean 
values and std are also displayed. 

Other ad hoc parameters are added to these basic features, 
for each category. They are summarised here. 

Newborn infant cry - Newborn infant cry is characterised by 
high fundamental frequency Fo (>300Hz), possibly with 
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abrupt changes and voiced/unvoiced features of very short 
duration within a single utterance. The frequency range is 
thus set up to 10 KHz. Fo, V/UV frames, spectrogram with 
the first 3 resonance frequencies superimposed, are plotted, 
all in coloured map. Some tables summarise mean, std, max, 
min values for Fo and F,-F3, as well as cry length and the 
corresponding maximum energy. These parameters are in 
fact considered among the most meaningful in newborn cry 
analysis (see [5] and references therein). 

Singing voice - Singing voice results from complex, 
voluntary movements of the larynx and of vocal tract 
articulators, and is characterized by possibly high-pitched, 
rapidly time-varying signals. As we deal with adult singers, 
the frequency range is set up to 6 KHz. Fo, vibrato rate, 
vibrato extent, vocal intonation, spectrogram with the first 5 
formants and PSD are plotted, along with formants maxima 
co-ordinates. These parameters are of importance for singers, 
being strictly related to correct vocal emission and hence to 
singer’s performance (see [6] and references therein). 

Adult hoarse voices - Among the huge number of available 
parameters for quantifying Fo irregularities, Jitter (J) and 
Relative Average Perturbation (RAP) were recognised by 
the physicians of relevance in most applications and 
implemented here. J and RAP mean and standard deviation 
(std) over the whole signal are also evaluated and displayed. 

An adaptive noise estimation technique is implemented, that 
allows tracking varying noise level during phonation. For 
pathological voices, spectral noise is in fact closely related 
to the degree of perceived hoarseness. Within BioVoice, 
noise variations are tracked by means of an adaptive version 
of the Normalised Noise Energy method, named ANNE 
(Adaptive Normalised Noise Energy) [7], [8]. It relies on a 
comb filtering approach, optimised in order to deal with data 
windows of varying length. Large negative ANNE values 
correspond to good voice quality, while values close to zero 
reflect the presence of strong noise. Spectrograms and PSD 
plots complete the set of pictures, allowing visual inspection 
of possible harmonic energy recovering. On the PSD plot, 
PSDo PSDiow, PSDhign are reported, quantify the signal 
global energy, the low-frequency and the high-energy one, 
respectively. SNR is also provided. These indexes could 
further help the clinician in assessing voice quality 
recovering. 


III. THE INTERFACE 


A user-friendly interface (Fig. 1) allows selecting age, sex 
and type of vocal emission for each patient, performing 
computations without any other requirement. The software 
tool automatically adjusts internal settings for optimal frame 
length, frequency range of analysis and plots. Specifically, 
the interface allows for: 

— selecting data (.wav files); 
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— choosing the voice type, ranging from high-pitched new- 
born and possibly singers voices to adult voices: the 
overall allowed Fy range is 40Hz<F¿<1300Hz; 

— selecting the kind of analysis: single audio file or two files 
(for comparison). 

A notice is added concerning computer time required: for 

long files (> 5s) and high sampling frequency (>40 kHz) the 

total time could approach Smin in total. A moving bar 
shows the residual time during computations. 

Plots and tables are displayed and saved in printable format, 

for a visual comparison of results, all in coloured map. 
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Figure 1 — BioVoice analysis tool: user interface 
IV. EXPERIMENTAL RESULTS 


BioVoice is applied here to nine patients (aged 18-74 years, 
mean 48) with breathy dysphonia, secondary to laryngeal 
hemiplegia or anatomical defects that underwent vocal fold 
lipoinjection. Lipostructure is a valuable technique for voice 
rehabilitation in glottis incompetence. Patients underwent 
pre- and post-treatment videolaryngostroboscopy, maximum 
phonation time (MPT) measurements, GRBAS perceptual 
evaluations, and Voice Handicap Index (VHI) self- 
assessments. Voice quality improved soon after surgery and 
remained stable over 3-26 months, as confirmed by GRBAS, 
MPT and VHI [9]. 

To show BioVoice features, one example is presented here, 
concerning a female patient. Before lipofilling, GRBAS 
scores were found as [3 3 2 2 0], denoting high level of 
dysphony, with a full recovering after the treatment (all 
GRBAS scores =0). Due to printing requests, figures are 
reported in a grey scale: (pre=light grey, post=black). 

Fig. 2 shows pre-and post treatment Fo tracking, along with 
its mean and std values. As pre- and post-treatment (PRT- 
POT) audio signals are usually of different length, the tool 
adjusts plots on the longer one. In this case, the PRT signal 
has a length of about 1.6s, while the POT one last about 3.6s. 


Posters 


Notice the long unvoiced period (above 2s) as found by the 
program for POT. 
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Figure 2 — Pre- and post surgical Fo tracking, mean anc 
values. 


Good recovering is shown, with stable POT Fo, ar 
207Hz, as compared to highly varying PRT that cou. 
evaluated for less that 1s. Fig.3 reports Jitter, RAP and 
tracking (with mean and std values), both for PRT and POT 
signals. From the figure, it is clearly shown that lipofilling 
greatly enhances voice quality under all these parameters. 
Again, notice non-voiced regions, where parameters could 
not be computed. 

Fig.4 shows PRT and POT spectrograms with formant 
tracking superimposed (black dots), along with mean and 
std values: after treatment, harmonics and formants are 
almost recovered and show a more regular behaviour and 
higher energy level (dark black) with respect to PRT ones. 
To quantify such results, the PSD plot is displayed in fig. 5, 
where the almost unvoiced and noisy PRT frequency 
content of the signal is evidenced (light grey line). On the 
contrary, POT PSD is characterized by a rather well- 
structured high-energy harmonic shape in the frequency 
range typical of voiced emission in adults (<2500Hz), and a 
low-energy one above this range, mainly related to noise 
(black line). 

Good recovering was found for almost all cases, and results 
were found in agreement with GRBAS scores. Due to the 
small number of available cases, statistical tests to assess 
reliability were not applied. 


V. FINAL REMARKS 
A new tool for voice analysis has been developed, based on 


robust adaptive techniques, capable to deal with a wide 
range of voice sounds. It is provided with a user-friendly 
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interface that requires few basic options to be made by the 
user. The method has already been successfully applied to 
pathological voices, to compare pre- and post-surgical voice 
quality in case of tyroplastic medialisation and cyst/nodule 
exeresis [10], [11]. 
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Figure 3 — Jitter, RAP and NNE tracking. 


As far as reliability of results is concerned, the method has 
been compared to one of the most used commercial software 
tools, i.e. MultiDimensional Voice Program (MDVP®, 
KayPentax Corp.), where NHR has been considered instead 
of NNE [11]. First results have shown that BioVoice 
performs more reliable analysis than MDVP. This could be 
due to more robust Fy estimation with BioVoice, and to the 
different analysis windows used: fixed for MDVP and 
adaptively tailored to varying Fo for BioVoice. 

The tool was developed under Matlab 7.3 and requires few 
minutes to perform complete pre-post analysis. If properly 
optimised and implemented under C++ environment, it 
could perform computations in almost real time. 

Further work will concern finding more strict correlations 
among objective indexes and perceptive ones, as well as 
exploiting and adding new possibly helpful indexes and 
plots. When properly optimised, the tool could be 
implemented on a mobile device, as an aid for clinicians, 
logopaedicians and patients, also for rehabilitation purposes, 
after surgery or medical treatment. 
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Figure 4 — Pre- (upper) and post-surgical (lower) 
spectrogram and formants tracking. Mean and std values are 
displayed. 
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IMPROVEMENT OF SOURCE-TRACT DECOMPOSITION OF SPEECH 
USING ANALOGY WITH LF MODEL FOR GLOTTAL SOURCE AND TUBE 
MODEL FOR VOCAL TRACT 


T. Dubuisson!, T. Dutoit ! 
1. Circuit Theory and Signal Processing Lab (TCTS Lab), Faculté Polytechnique de Mons, Belgium 


Abstract: In this paper we propose improvements to a 
recent algorithm of speech decomposition into glottal 
source and vocal tract contributions. This algorithm is 
based on the Zeros of the Z-Transform (ZZT) 
representation and requires restrictive conditions 
about the analysis window. Inaccurate results of 
decomposition can occur if these conditions are not 
fulfilled. The improvement method consists in 
considering an analogy with the LF model for the 
glottal source and a tube model for the vocal tract. 
Results are presented for a sustained vowel /a/ in both 
time and spectral domain. Future developments are 
also proposed. 

Keywords: Zeros of the Z-Transform, glottal source, 
vocal tract, speech decomposition, Glottal Closure 
Instant 


I. INTRODUCTION 


Analysis of the glottal source has been investigated by 
researchers because it has applications in different fields 
like speech recognition or voice quality modification. 
Among glottal source estimation techniques described in 
literature, some use iteratively the inverse filtering 
method [1] in order to remove the vocal tract contribution 
in speech while other apply the LP analysis only during 
the closed-phase of the glottal source [2] in order to 
minimize its effect on vocal tract estimation. Another 
method uses the ARX model [3] in order to jointly 
estimate glottal source model and vocal tract model 
parameters. Finally some methods focus on the 
estimation of glottal source parameters like Open 
Quotient [4] or Glottal Closure Instants (GC) [5 ,6]. 

Recently another technique of decomposition of speech 
into glottal source and vocal tract contributions was 
proposed. This technique uses the ZZT representation of 
speech [7] and is particularly sensitive to GCIs 
localization. Applied to real speech signals, errors on the 
estimation of these instants can sometimes lead to noisy 
decomposition results. That is why improvements of this 
decomposition are presented here. 

This paper is organized as follows. In section II the 
ZZT representation is defined, the ZZT-based 
decomposition and its improvements are described. In 
section III the results of improvements are presented for a 
sustained vowel /a/ and compared with results obtained 
without correction. In section IV the results are discussed 
and the perspectives are presented. 


II. METHODS 
A. Database 


Test signals were recorded (16 kHz-16 bits) in TCTS 
Lab and are real sustained vowels /a/, /e/, /o/ and real 
transitions between these vowels. 


B. ZZT representation and decomposition algorithm 


For a N samples signal x(n), the ZZT representation 
[7] is defined as the set of roots Z,, of the z-transform X(z) 
of the signal x(n): 


N-1 N-1 
X(2)=Y xmz” =x(07""] [G-Z,,) (1) 
n=0 m=1 
In order to decompose speech into glottal source 
(glottal flow derivative) and vocal tract impulse response 
[7], ZZT are computed on frames centered on each GCI 
(computed by the algorithm described in [5]) and where 
length is twice the fundamental period at the considered 
GCI. The glottal source spectrum is then computed from 
zeros with modulus greater than 1 (maximum-phase 
components) and the vocal tract spectrum from zeros 
with modulus lower than 1(minimum-phase components). 


C. Improvement of the decomposition 


Due to errors on the estimation of GCIs, 
decomposition results can sometimes be noisy, and thus 
not suitable for accurate analysis of the glottal source. 
Experiments showed that, if ZZT-based decomposition is 
computed for several frames whose center is shifted by 
few samples around a GCI, better results can be obtained 
for an instant close but not identical to this GC/. The 
method considers here, for a range of shifts around GCIs 
in voiced island of speech, the vocal tract candidate 
(VTC) and the glottal source candidate (GSC) obtained 
from ZZT-based decomposition in order to determine 
which shift provides the best results for each GCI. 

Concerning the glottal source, an analogy is made 
with the LF model [8]. Indeed, considering GSCs 
obtained for shifts around a given GCI, inaccurate 
decompositions are mainly characterized by a lot of 
energy located in frequencies higher than 2 kHz, contrary 
to LF model in which energy is mainly located below 2 
kHz. Each GSC is therefore characterized by the energy 
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ratio between the 0-2000 Hz band and the whole 
spectrum: 

Feature GSC = "meray [0-2000 Hz] (2) 
Energy [0—8000 Hz] 

The vocal tract being a physical system with its own 
structure and elasticity, it is assumed that, during the 
production of a sustained vowel, it has to be as 
continuous as possible in terms of geometry. To express 
this continuity, the tube model [9] for the vocal tract is 
used and the radiuses of this model are computed by LP 
analysis [10] of the vocal tract impulse response (order 
set to 18). Each VTC is therefore characterized by a 
vector of 19 radiuses. 

Around each GCI, the shift corresponding to the best 
decomposition must be a compromise between two 
criterions: 

e GSC: among all the candidates, the elected one 
must be characterized by the biggest energy ratio 
between the 0-2000 Hz band and the whole spectrum. 
The criterion is thus the minimization of the energy in 
high frequencies. 
e VTC: during the production of a sustained vowel, 
the geometry of the vocal tract cannot vary too much 
between two consecutive GCIs. Among all the VTCs, 
the elected candidate must be the one for which the 
vector of radiuses is the closest to the one 
corresponding to the candidates for the past and the 
next GCI. The criterion is thus the maximization of 
the continuity of the vocal tract geometry. 


A dynamic programming algorithm is therefore 
implemented to optimize these criterions on the whole 
voiced island of speech (see Fig. 1). 


Sine . a io 1 € 
ee 


Fig. 1 Dynamic programming algorithm (1 shift before 
and after each GCI — 3 states) 


In this algorithm each step corresponds to a GCI and 
each state corresponds to a particular shift around this 
GCI. The goal of this algorithm is to find the best path 
among all the shifts by minimizing a cost function on the 
whole voiced island of speech: 


Cost(i,j) = Cost(i—1,k) + Tcl" + oc) ® 


where i stands for the step index, j for the state index at 
step i, k for the state index at step i-/, TC (Transition 
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Cost) stands for the difference of the radiuses between 
VTC at state k and VTC at state j, OC (Observation Cost) 
stands for the inverse of the feature defined for the GSC 
at state j. At the end of the voiced island of speech, the 
best path is chosen as the one with the lowest cumulated 
cost. The position of the GCIs can thus be corrected 
according to this choice. 


HI. RESULTS 


As explained in Section II, the dynamic programming 
algorithm determines the best shift around each GCI 
according to constraints defined by the cost function. Fig. 
2 shows the evolution of its decision for the sustained 
vowel /a/ and for 4, 6 and 8 samples of shift before and 
after each GCI (9, 13 and 17 states). 


Fig. 2 Decision of the algorithm (from top to bottom: 4, 6 
and 8 samples of shifts before and after each GCI) 


One may see in this figure that considering 4 samples 
of shift is not enough (saturation is visible in the decision 
of the algorithm) while computing the decomposition for 
8 samples is not necessary (the decision is nearly the 
same than for 6 samples). However we will show in the 
next subsection that the results obtained for 4 samples of 
shift are accurate enough. A shift of 4 samples before and 
after each GCI is therefore considered as a good choice 
because the improvement obtained for more samples of 
shift does not justify the increasing cost of computation. 
From now on the results are presented for 4 samples of 
shift. 


A. Results of improvements in time domain 


Fig. 3 shows the glottal sources obtained with and 
without correction. One may see in this figure that the 
noisy components are corrected and that the accurate 
ones before correction remain unaltered. The vocal tract 
responses are not displayed because the spectral domain 
is more suitable in order to observe the improvement on 
this component. 
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Fig. 3 Improvement of the glottal source in time domain 
(a: without correction; b: with correction) 


B. Results of improvements in spectral domain 


In this study a GCI-synchronous spectrogram is 
computed. This representation shows the evolution of the 
normalized spectrum of each GCI-centered period of 
glottal source and vocal tract response. Fig. 4 shows the 
GCI-synchronous spectrogram for the glottal source 
without and with correction. 


Fig. 4 Improvement of the glottal source in spectral 
domain (a: without correction; b: with correction) 


Accurate glottal sources are characterized by a 
resonance in low frequencies (the glottal formant) and 
energy located below 2 kHz while the noisy ones have 
more energy in higher frequencies. After correction the 
noisy glottal sources are closer to the other accurate ones. 

Concerning the vocal tract impulse response, the 
formants detected by Wavesurfer [11] on the speech 
signal are superimposed on the spectrogram in Fig. 5 
(dotted lines). The correlation between the trajectory of 
the formants and the ones detected by Wavesurfer is good 
before correction although there are discontinuities in the 
formant trajectories for some GCIs. These discontinuities 
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are less present after correction and the energy bursts in 
high frequencies have disappeared. 


| 
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Fig. 5 Improvement of the vocal tract response in spectral 
domain (a: without correction; b: with correction) 


C. Indicators of improvement 


In order to quantify the amount of improvement for 
the two components, indicators are proposed. The glottal 
source indicator is defined as: 


100 x quer —Mgf be + | (4) 


Mgfbe Fbe 


where Mogy stands for the magnitude of the glottal 
formant (spectral resonance detected in the 0-250 Hz 
band) after correction, Mz. stands for this magnitude 
before correction, Fy stands for the energy ratio of the 
glottal source after correction and F’,, for this ratio before 
correction. Fig. 6 shows this indicator for the whole 
sustained vowel /a/. 
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Fig. 6 Glottal source indicator 
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This indicator shows strong peaks (full arrows) at the 
GCIs for which the resonance in low frequency is not 
strong enough before correction and smaller peaks 
(dotted arrows) at those for which the glottal sources have 
resonance in low frequencies before correction and are 
less noisy after correction. 

The vocal tract indicator uses the information from 
Wavesurfer in order to quantify the improvement in an 
objective way. The formant indicator is defined as: 


100 x “Me (5) 
Mbe 

where M.,y stands for the magnitude at the formant 
frequency in the vocal tract spectrum after correction and 
M5. stands for this magnitude before correction. The 
indicator for the vocal tract is the sum of the formant 
indicator for the two first formants detected by 
Wavesurfer. Fig. 7 shows this indicator for the whole 
sustained vowel /a/. 


Fig. 7 Vocal tract indicator 


This indicator shows strong peaks (full arrows) at the 
GCIs for which the discontinuity in the formant 
trajectories is important before correction and smaller 
peaks (dotted arrows) at those for which the energy in 
high frequencies is more important than for other GCIs 
before correction, but without discontinuities. 


IV. DISCUSSION AND CONCLUSION 


The method presented here is based on the ZZT 
representation. It thus differs from the inverse filtering 
based on LP analysis because the estimated LP filter 
contains both the contributions of glottal source and vocal 
tract. It also differs from the ARX based methods because 
the ZZT-based decomposition is not based on a glottal 
source model but only on phase properties of speech 
signal. 

The purpose of this method is the improvement of the 
decomposition of speech into glottal source and vocal 
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tract response using analogy with the LF model for glottal 
source and tube model for the vocal tract. These two 
components are characterized by features used in a 
dynamic programming algorithm in order to better 
determine the position of GCIs in voiced islands of 
speech. Accurate results are obtained for sustained 
vowels. ZZT-based decomposition and its improvement 
can lead to computation of parameters like open quotient 
or asymmetry coefficient [8] and adequacy between LF 
model and glottal sources obtained from real speech 
signals. In case of vocal folds pathology, the observation 
of ZZT-based decomposed sequences could lead to 
propose a new model for the glottal source. 


ACKNOWLEDGEMENTS 


Authors acknowledge support from the Walloon 
Region, Belgium, grant WALEO II ECLIPSE #516009, 
the COST Action #2103 ‘Advanced Voice Function 
Assessment’ and the Interuniversity Attraction Pole 
program VI-4 DYSCO of the Belgian Science Policy. 


REFERENCES 


[1] P. Alku, “An automatic method to estimate the time- 
based parameters of the glottal pulseform,” Proc. of 
ICASSP 1992, IEEE, vol. 2, pp. 29-32, 1992. 

[2] E. Moore and M. Clements, “Algorithm for automatic 
glottal waveform estimation without precise glottal 
closure information,” Proc. ICASSP 04, IEEE, vol. 14, 
pp. 492-501, 2004. 

[3] D. Vincent, O. Rosec, and T. Chonavel, “Estimation 
of the LF glottal source parameters based on ARX 
model,” Proc. Interspeech 2005, ISCA, pp. 333-336, 
2005. 

[4] N. Henrich, B. Doval, and C. d’Alessandro, “Glottal 
open quotient estimation using linear prediction,” Proc. 
MAVEBA 1999, IEEE, pp. 12-17, 1999. 

[5] H. Kawahara, Y. Atake, and P. Zolfaghari, “Auditory 
event detection based on a time domain fixed point 
analysis,” Proc. ICLSP 2000, ISCA, vol. 4, pp. 669-672, 
2000. 

[6] A. Kounoudes, P. Naylor, and M. Brookes, “The 
DYPSA algorithm for estimation of glottal closure 
instants in voiced speech,” Proc. ICASSP 02, IEEE, vol. 
1, pp. 820-857, 2002. 

[7] B. Bozkurt, L. Couvreur, and T. Dutoit, “Chirp group 
delay analysis of speech signals,” Speech. Comm., vol. 
49, issue 3, pp. 159-176, 2007. 

[8] G. Fant, “The LF model revisited. Transformation and 
frequency domain analysis,” STL-OPSR, vol. 2-3, pp. 
121-156, 1995. 

[9] J. Kelly and C. Lochbaum, “Speech synthesis,” Proc. 
of 4” International Congress of Acoustics, pp. 1-4, 1962. 
[10] D.G. Childers, Speech Processing and Synthesis 
Toolboxes, John Wiley & Sons, 1999, pp. 95-127. 

[11] Wavesurfer : http://www.speech.kth.se/wavesurfer. 


METHODOLOGY OF FUNDAMENTAL FREQUENCY EXTRACTION 
AND ANALYSIS USING MICROPHONE SPEECH SIGNAL AND VOCAL 
TRACT MODEL 


Z. Ciota 


Department of Microelectronics and Computer Science, Technical University of Lodz, Poland 


Abstract: A model of vocal tract has been presented. 
According to the dimensions of natural tract, 
equivalent parameters of the model have been 
calculated. The proposed model permits to extract 
important parameters of vocal signal, especially 
frequency parameters of glottal waves. The proposed 
system of speech processing permits also for analysis 
and synthesis of all phonemes. The system is oriented 
on the verification of speech malfunctions. The 
improvement of verification process can be improved 
by using more sophisticated classifiers, like neural 
networks. 

Keywords: Vocal tract, fundamental frequency, speech 
verification, neural networks 


I. INTRODUCTION 


An influence of fundamental frequency fluctuations on 
the final speech sound can be verified using a vocal tract 
model. Anatomical structure of human vocal tract as well 
as actions of all speech production organs, e.g. lips, 
velum, tongue, nostril, glottis, larynx and corresponding 
individual muscle groups, are very complicated. 
Nevertheless, such natural vocal tract is a datum-point of 
different mathematical models [1, 2]. Our model should 
take into account all elements and phenomenon's 
appearing during speech process. Afterwards, it is 
necessary to define this part of vocal tract, which will be 
modeling. In our model the following parts of anatomical 
tract have been included: input signal source coming 
from larynx, faucal tract, mouth-tract, nasal tract and 
radiation impedances of both mouth and nose. 


II. MODELING OF VOCAL TRACT 


In the case of fundamental frequency calculation, two 
basic methods are available: autocorrelation and cepstrum 
method. The first permits to obtain precise results, but we 
discovered that additional incorrect glottis frequencies 
have been created. We observed additional improper 
frequencies especially for ranges lower than 100 Hz and 
higher than 320 Hz. Additionally, our software indicates 
some glottis excitations during breaks between phones 
and in silence regions. Therefore, in this method it is 
necessary to apply special filters to eliminate all incorrect 
frequencies. Another method bases on cepstrum analysis. 


The complex values of cepstrum C(7) can be obtained 
using the following equation: 


C(T) = F [log F(x()] (1) 


where F is Fourier transform and x(?) represents speech 
signal. 

In this transform the convolution of glottis excitation 
and vocal tract is converted, first to the product after 
Fourier transform, separated them finally as the sum. In 
our method we use cepstrum analysis as less complex, 
especially when we applied modulo of cepstrum by using 
modulo of Fourier transform. The following values of 
glottis frequency Fo have been taken into account: Fo- 
minimum, /o-maximum, the range and average values 
including statistical properties [3, 4]. 

One of the possible methods is anatomical tracts 
replacement by coaxial connections of cylindrical tube 
sections. Each section has to fit as much as possible to the 
dimensions, e.g. cross-sectional area and section length, 
of natural vocal tract. Such vocal tract model should take 
into account a faucal tract which folks into nasal tract and 
also a mouth tract. It is also important to maintain the 
dimensions of natural tract: length of faucal tract (80 
mm), length of mouth tract (80-100 mm) and nasal tract 
length (120 mm). Unfortunately, cross-sectional areas 
cannot be unequivocally calculated, because people have 
different cross-dimensions of vocal tracts. The 
complexity of the model depends on the number of tube 
sections. 

Behavior of model sections can be analyzed as 
relations between pressures of acoustic wave pin, 
volumetric velocity V;, and corresponding output 
quantities pout and Vou, for a current section. Moreover, 
an acoustic pressure and volumetric velocity correspond 
to electrical values: voltage and current respectively. In 
the next step we can replace each tube by electrical 
equivalent circuit. All parameters can be calculated from 
geometrical dimensions of the tube: inductance Z as an 
equivalent of an air acoustic mass in the tube; capacitance 
C as an equivalent of air acoustic compliance; serial 
resistance as an equivalent of resistance loss caused by 
viscotic friction near by tube walls; additional negative 
capacitance C,, - an equivalent of inverse acoustic mass of 
vibratory tube walls; conductance G, an equivalent of 
acoustic loss conductance of thermal conductivity near by 
tube walls; additional conductance G,, an equivalent of 
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acoustic conductance of loss conductance of vibratory 
tube walls; pulsation œ. 


HI. RESULTS 


According to the above assumptions, we can calculate 
equivalent parameters for all sections of the model. 
However, the calculations are complicate and time- 
consuming. To avoid these problems the model can be 
simplified. One can establish that acoustic wave in the 
channel is two-dimensional plane wave. Using such 
model it is possible to obtain the transmittance of vocal 
tract. As a source of soundless signals we propose a 
nozzle model, because a time-domain characteristic of 
soundless phonemes is random, and the signal can vary 
for the same phoneme and for the same person. Using this 
model we can present different features of noise 
phonemes. It is possible to calculate a middle frequency 
of the noise as well as a total acoustic power of the 
channel. It is also possible to obtain frequency 
characteristic. 
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The knowledge of vocal tract parameters permits to 
examine features and malfunction of a speech process. It 
is also possible to examine which part of the vocal tract is 
responsible for distortion of speech signal, including the 
influence of fundamental frequency jitter on speech 
quality. Examples of simulations for a vowel "J" and "A" 
for the same excitation (fundamental frequency) are 
presented in Fig. 1 and Fig. 2. 

The process of such vector recognition consists of two 
main parts: a teaching and an appropriate recognition, 
according to Fig. 3. During the teaching process you 
create the base of parameters. The comparison of current 
voice with the stored base gives the answer concerning 
the emotional state or identification process of examined 
utterance. The comparison process and the final decision 
are based on the following standard classifiers: nearest 
mean and nearest neighbour. The decision process can be 
optimized using different distances and parameter 
weights. This part of method is very important and still 
open. Especially, in the present of low quality teaching 
materials, it would be necessary to applied probabilistic 
method and multilayer neural perceptrons [1]. 


Speech 
signal 


y 


Extraction of glottis 
parameters 


Training Le ca 
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Resulting [P| Likelihood 
templates le— Comparison 
Results 


Fig. 3. System for quality speech verification 


The simplification of speech process in computer 
systems gives a lot of redundancy, so fuzzy logic 
approach to speech prediction seems to be a promising 
solution. On the other hand, in the case of fuzzy system, 
we have to use in the beginning of the design process, 
pre-defined membership functions and a linguistic model, 
applying an expert knowledge. Unfortunately, the system 
should work correctly with different users and very often 
with different languages. So, we have very limited 
possibility to adapt our speech processor to strongly 
varying conditions. Better solution can be obtained using 
artificial neural network (ANN). Such system permits to 
add two important developments: learning function and 
adaptive possibilities. The linear predictor can be realized 
by using multi-layer feedforward ANN. 
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For high learning efficiency a standard 
backpropagation method has been extended by adding a 
momentum term. Basic idea of multi-layer feedforward 
network is presented in Fig. 4. In the case of recognition 
two important problems have to be taken into account. 
The first is the highest system performance in terms of 
the certainty of recognition and verification and the 
second is the total cost of the system. 


Hidden layers Output layer 


Input layer 


Fig. 4. Multi-layer feedforward neural network 


N (hidden layer) = 40 
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N (hidden layer) = 60 
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Fig. 5. Examples of learning process for different number 
of neurons in the hidden layer 


Preliminary, we use as an example feedforward neural 
network organized with one hidden layer identification 
process. Input layer contains 64 neurons, hidden layer 40, 
and output has 7 neurons. Input of the ANN has been 
connected to the vector features of current speaker and 
output states should indicated result of identification. The 
learning process was based on the stored features. All 
neuron models have been described by the same sigmoid 
activation function, with the possibility to control of slope 
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parameter. As learning algorithm we applied standard 
backpropagation method adding a momentum term, 
which is necessary to avoid a local minimum. 

An example of learning process is shown in Fig. 5. 
Increasing of hidden layer neurons over 40, gives only 
slightly and practically invisible improvement of learning 
process. The preliminary simulations are promising, but 
further researches are necessary, especially for bigger 
output resolution. 


Table 1. Fundamental frequency parameters for the same 
person 


Emotions Fo mean Fo max Fo min Fo range 
Anger 202 231 105 126 
Joy 208 240 164 76 


Extraction of fundamental frequency parameters for 
female voice is shown in Table 1. It is very easy to 
observe high sensitivity of such parameters to emotional 
state of the speaker. The range of Fo is equal to 126 Hz. 
for anger speech, while in the case of joy speech the 
corresponding range decreases to 76 Hz. 


IV. CONCLUSION 


The frequency parameters of glottal waves have been 
extracted using rather simple vocal tract model. 
Autocorrelation and cepstrum methods are also helpful in 
such extraction. The results are important not only for 
speaker identification and emotion recognition, but can 
be also helpful for glottis malfunction diagnosis. 

The results of the speech processing system are 
satisfying, but sometimes we can observe mistakes in 
recognition process. However, some of these processes, 
especially for emotion recognition, make difficulties also 
for human evaluation. On the other hand, a quality of the 
program depends on the training processes. 
Unfortunately, it is difficult to obtain a proper base of 
voice examples for different emotions. However our 
program can recognize two states: positive and negative 
emotions with almost 100% precision. Moreover, the 
proposed algorithms can be applied not only for emotion 
detection but also can be helpful in the process of medical 
diagnosis of speech processes. The sensitivity of the 
program for such emotions, like anger or fear is 
measurable, but the vectors of properties can be in the 
future modified. Moreover, a proper distance calculation 
between vectors of examined person and database is very 
important task, therefore we are trying to apply neural 
networks to solve the problem. 

On the other hand, the implementation of the proposed 
algorithms using hardware-software system, including 
mixed analog-digital approach, should improve the speed 
and the quality of proper recognition [5]. Application of 
mixed digital-analog realization to the design process of 
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sound processors may be better in comparison with 
purely digital solution and very often we can achieve 
better results, decreasing the chip surface and increasing 
the speed parameter of the system. 

The improvement of decision process can be achieved 
by using more sophisticated classifiers, like a fuzzy 
system for effective comparison a current speaker 
features with a stored data base. The simplification of 
recognition process in computer systems gives a lot of 
redundancy, so fuzzy logic approach connected witch a 
neural system seems to be promising solution. Such 
fuzzy-neural system permits to add two important 
developments: learning function and adaptive 
possibilities. The system can be realized by using multi- 
layer feedforward ANN. 

Purely software realization of the system gives the 
principal contradiction: parallel calculations e.g. neural 
networks, are executed by serial paradigm (computer). 
Unfortunately, fully hardware realization of a big neural 
network is impossible, because physical connections 
between the neural units present a big technological 
problem. The better solution gives  parallel-serial 
approach. Mixed digital-analog realization can increase 
the efficiency of such solution. 


MAVEBA 2007 


ACKNOWLEDGMENT 


The research effort is sponsored by the grant of Polish 
Ministry of Education and Science No. 3T11B 027 29 


REFERENCES 


[1] Progress in speech synthesis, edited by J. Santon et 
al., Springer, New York 1996. 

[2] P. Gray, M.P. Hollier, R.E. Massara: "Non-intrusive 
speech-quality assessment using vocal-tract models" JEE 
Proc.-Vis. Image Signal Processing, vol. 147, no 6, 2006, 
pp.493-501 

[3] Z. Ciota : “Emotion Recognition on the Basis of 
Human Speech”, /CECom-2005, 18th International 
Conference on Applied Electromagnetics and 
Communications, 12-14 October 2005, Dubrovnik, 
Croatia, pp. 467-470 

[4] Chul Min Lee, Shrikanth S. Narayanan: "Toward 
Detecting Emotions in Spoken Dialogs", JEEE Trans. 
Speech and Audio Processing, vol. 13, no 2, March 2005, 
pp.293-303 

[5] Roberts W.JJ., Yariv Ephraim: "Speaker 
Classification Using Composite Hypothesis Testing and 
List Decoding", IEEE Trans. Speech and Audio 
Processing, vol. 13, no 2, March 2005, pp.211-219 


A FLUID-STRUCTURE INTERACTION MODEL OF VOCAL FOLD 
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Abstract: Since fluid-structure interaction within the 
finite-element method is state of the art in many engi- 
neering fields, this method is used in voice analysis. A 
quasi two-dimensional model of the vocal folds includ- 
ing the ventricular folds is presented. First results of 
self-sustained vocal fold oscillation are presented and 
possibilities as well as limitations are discussed. 
Keywords: Fluid structure interaction, finite element 
method, vocal fold oscillation 


I. INTRODUCTION 


Fluid-structure interaction effects are of great impor- 
tance in models of vocal fold oscillation. This effect has 
been described by low degree of freedom approaches 
(multiple-mass models similar to [1]). Another possibility 
is a finite-element (FE) attempt. In flow analyses with 
moving boundaries the structural part (vocal fold tissues) 
is normally modeled separately from the fluid part 
(flowing air). Both of the domains have to be coupled to 
influence each other. For this purpose the Arbitrary La- 
grangian Eulerian method (ALE) is used which combines 
the Lagrangian structural model with the Eulerian fluid 
model. More and more of those models are currently de- 
veloped [2],[3],[4]. 


Since these effects are also topic in other fields of re- 
search such as mechanical or civil engineering, commer- 
cial codes have been designed that provide powerful 
methods for simulating the structural part, the fluid part, 
and the interaction of both of them. This study documents 
first results, limitations and possibilities of a fluid- 
structure interaction model of the vocal folds which is 
calculated by commercial solvers. 


II. METHODS 


The complete vocal fold model consists of two coupled 
domains: A fluid domain representing the air, and a struc- 
tural domain representing the vocal fold tissue. In princi- 
ple, each of the domains is a stand-alone simulation 
model. The structural part is solved by ANSYS, the fluid 
part by CFX. The simulations have been performed in 
transient mode with duration of 80 ms. Time steps have 
been adapted in a range between 0.1 ms and 2 ms in order 
to calculate stably and efficiently. 


The model is a three-dimensional slice (thickness: 0.5 
mm) from which only two dimensions are of interest. The 
considered section of the air domain is located in the 
frontal plane. It has a length of 30 mm and a width of 12 
mm at its upper and lower end. The cranial-caudal thick- 
ness of the vocal fold is 7 mm. Concerning geometry, the 
local z-axis is a symmetry axis at x = 0. 


Nevertheless, no half model with symmetry boundary 
condition has been applied at this axis in order to simu- 
late asymmetrical flow effects. The air is modeled as a 
transient, viscous, and laminar flow. The physical back- 
grounds of the flow are the standard Navier-Stokes equa- 
tions. At the lower boundary a relative pressure of 800 Pa 
is set. At the upper boundary, the relative pressure is zero. 
The lateral sides have wall boundaries. A “moving wall”, 
which changes the fluid mesh, is defined as boundary 
condition for each of the vocal folds. This moving wall is 
the coupling interface of the models. A multigrid solver 
and a general relaxation parameter of 0.3 are chosen in 
order to obtain better convergence with moving bounda- 
ries. 


E 
eS 


Fig. 1: Left: fluid mesh. Right: Structural mesh with contact 
lines. 


For structural analysis, linear volume elements are used. 
Since the displacements are large, geometric non-linearity 
is taken into account. The mesh of the structural part is 
derived from the same geometric form as the fluid part. 
To avoid vocal folds interpenetration, a contact area was 
set up between them. Numerical problems of the fluid 
solver during a complete closure of the vocal folds are 
omitted by a small gap between the folds which cannot be 
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closed. In Fig. 1 a sketch of the models can be found. The 
boundary conditions are bearings at the lateral ends of the 
vocals folds. To simulate the stiffness caused by the 
prestressed tension in the vocalis muscle and ligament, 
orthotropic material properties have been defined. The 
basis of the domain coupling is the ALE formulation. The 
domains are coupled sequentially. The transfer of the 
coupling information takes place along a common line of 
the vocal fold (structure) and the airway (fluid) which can 
be seen as black line in Fig. 1. The coupled calculation 
consists of different calculation loops. First, the fluid so- 
lution is obtained and the resulting pressures are trans- 
ferred as loads onto the structural model which is solved 
afterwards. The obtained displacements are then trans- 
ferred back onto the fluid mesh until convergence is 
reached. 


HI. RESULTS 


The Young's modulus of the structure is 7.0 N/mm? in 
lateral direction and to 20.0 N/mm? in the other direc- 
tions. The Poisson's ratios were set to 0.4 and the shear 
moduli to 5.0 N/mm?. Density was taken to 1040 kg/m?. 
These properties result in structural eigenfrequencies of 
114 Hz (first), 188 Hz (second), and 247 Hz (third) (see 
Fig. 2). After approximately 20 ms a relatively stable 
oscillation could be achieved. The oscillation pattern was 
a combination of the first and the second eigenforms 
where the second eigenform was clearly dominant. 


1 i 65 
oe 7 
3h" 8 
4 9 
55° 10 


Fig. 2: Left: Extrema of the eigenforms of the first (114 Hz, top), 
second (188 Hz, middle), and third eigenfrequency (247 Hz, bot- 
tom). Right: One oscillation period (t = 5 ms with At = 0.5 ms 
from number 1 to 10). 


The obtained velocity profile (at the mid point between 
the vocal folds at the caudal end) had an offset of 5-10 
m/s due to the modeled gap between the folds. The spec- 
trum of these velocities shows two peaks (176 Hz and 
337 Hz, see Fig. 3). Concerning the flow in the supraglot- 
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tal area, a jet is formed which has the tendency to orient 
towards one lateral side (Coanda effect). 


o 
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o 200 400 600 800 
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Fig. 3: Velocities of the airflow at the caudal 
end of the vocal folds (top) and their spectrum 
(bottom). 


IV. DISCUSSION 


Feasibility of a fluid-structure interaction model with 
commercial finite-element codes was shown. The ob- 
tained results have to be regarded as preliminary. The 
oscillation pattern showed a predominant role of the sec- 
ond eigenform while other models suggest a dominant 
first eigenform [5],[6]. To get stable results in the fluid 
solver, a small channel has to be left open. So a complete 
closure of the vocal folds is impossible. The influence of 
this constriction will have to be examined more explic- 
itly. In future studies, more results will be calculated and 
compared to literature data. The influence of the vocal 
fold shape will be another point of interest. 
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Abstract: This work presents a procedure for the estimation 
of a two-mass vocal fold model starting from a time-varying 
target flow signal. The model is specified by a large number of 
physical parameters, computed as functions of four articula- 
tory parameters (three laryngeal muscle activations and sub- 
glottal pressure). Flow waveforms synthesized by the model 
are characterized by means of a set of typical voice source 
quantification acoustic parameters. Given a sequences of tar- 
get acoustic parameters, dynamic programming techniques 
and interpolation based on Radial Basis Function Networks 
are used to derive sequences of articulatory parameters that 
lead to resynthesis of the target signal. 

Keywords: Voice source, Low-dimensional models, Estimation, 
Synthesis 


I. INTRODUCTION 


One open problem in research on low-dimensional vocal 
fold physical models is the relationship between parame- 
ters of the models and acoustic parameters related to voice 
quality. A recent work [1] studied the sensitivity of acoustic 
tow parameters to variation of physical parameters in a 
two-mass model, and provided indications of the “actions” 
that the model employs to target different voice qualities. 
However low-level parameters (masses, spring stiffnesses, 
etc.) are not independently controlled by a speaker: more 
physiologically motivated control spaces are needed. A 
related issue is the “inverse problem”, i.e. the problem of 
estimating the time-varying control parameters to be used 
as input to the physical model in order to resynthesize a 
target acoustic signal. This involves inversion of a non- 
linear dynamical system with a large number of parame- 
ters. Moreover the solution is in principle non-unique. A 
possible solution to the non-uniqueness problem is working 
on temporal sequences of acoustic frames and estimating 
articulatory parameters through minimization of some cost 
function that includes an “articulatory effort” component. 
This approach has been applied in [2] to the solution of 
the inverse problem for an articulatory vocal tract model. 

This paper presents a procedure for the estimation of a 
two-mass vocal fold model [3] starting from time-varying 
acoustic parameters of a target t ow signal. The model is 
speciTed by a large number of low-level physical parame- 
ters. An additional modeling layer computes these physical 
parameters as functions of four articulatory parameters 


(three activation levels of laryngeal muscles and subglottal 
pressure) [4]. Glottal t ow waveforms synthesized by the 
model are characterized by means of a set of acoustic 
parameters: fundamental frequency Fo, open quotient OQ, 
speed quotient SQ, return quotient RQ, normalized ampli- 
tude quotient NAQ [5], etc., that are used in the literature 
as typical voice source quantiTcation parameters [6]. 

Therefore there are three related but distinct spaces of 
parameters: articulatory, physical, and acoustic parameters. 
This work deals with the problem of mapping acoustic 
into articulatory parameters. We tackle the problem by 
characterizing temporal frames of glottal t ow signals via 
sequences of acoustic parameters, and by developing a 
methodology to derive the corresponding sequences of 
articulatory parameters using dynamic programming tech- 
niques. The procedure is further improved by using Radial 
Basis Function Networks (RBFN) to interpolate points 
in the articulatory space. Results show that the physical 
model controlled via the estimated parameters is able to 
resynthesize target t ow signal with good accuracy. 

Section II describes the physical model used in this work 
while Sec. III details the techniques used to estimate the 
model starting from a target time-varying t ow signal. Re- 
sults, as well as and current limitations and shortcomings 
of the proposed approach, are discussed in Sec. IV 


II. THE PHYSICAL MODEL 


The analysis developed in the next sections is based on 
a two-mass model presented in [3] and depicted in Fig. 1. 
The model assumes in particular one-dimensional, quasi- 
stationary, frictionless and incompressible t ow from the 
subglottal region up to a time-varying separation point 
zs along the glottis, where t ow separation and free jet 
formation occurs. No pressure recovery is assumed at the 
glottal exit. The separation point zs is predicted in [3] to 
occur when the glottal area a(z) exceeds the minimum area 
by a given amount (10—20%). By introducing a separation 
constant s (in the range 1.1 — 1.2), separation occurs when 
the glottal area takes the value a, = min(sa1, a2). 

The vocal tract is modeled as an inertive load. In the 
limit of fundamental frequencies much lower than the Trst 
formant frequency the air column acts approximately as a 
mass that is accelerated as a unit, and the vocal tract input 
pressure can be written as p,(t) = Ru(t) + Iù(t), where 
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Fig. 1. Right: schematic diagram of the vocal fold, trachea, and 


supraglottal vocal tract; left: two-mass vocal fold model. 


R,I are the input resistance and inertance, respectively. 
Values for R,J are chosen from [7]. Being a Trst-order 
system, this model does not account for resonances of the 
vocal tract, however it describes with sufTcient accuracy its 
most relevant effects on vocal fold oscillation, in particular 
the lowering of the oscillation threshold pressure [7]. 
Low-level physical parameters (masses, spring stiff- 
nesses, etc.) are not independently controlled by a speaker: 
more physiologically motivated control spaces are needed, 
which requires to establish a mapping between physiology 
(muscle activations) and physics (parameters of the two- 
mass model). A set of empirical rules, derived from [8], 
was used in [4] for controlling a two-mass physical model. 
The rules link vocal fold geometry to activation levels of 
three muscles: cricothyroid (acr), thyroarytenoid (ara) 
and lateral cricoarytenoid (arc). These levels are assumed 
to be normalized in the [0,1] range. In addition, in this 
paper we also consider the subglottal pressure ps. In 
conclusion, the physical model is completely controlled by 
the set of four articulatory parameters act, ATA, ALC, Ps. 


III. MODEL ESTIMATION 
A. An articulatory codebook 


The Trst step of the estimation procedure is to deTne 
and populate a direct codebook, in which every vector of 
articulatory parameters aor, ar A, QLC, Ps iS a “key” and is 
associated with one and only one vector of acoustic param- 
eters. To this aim, a large number of numerical simulations 
of the two-mass model is run on a dense grid of vectors of 
acoustic parameters. For each simulation, relevant acoustic 
parameters are extracted from the synthesized glottal t ow 
signal using the APARAT toolkit [9]. 
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Fig. 2. Distribution of acoustic parameters in the direct codebook. 


The direct codebook used in this work has been derived 
on a grid where acr and ara vary in the range 0 +1 with 
a Txed step of 0.05, while the range for arc is 0.25 + 0.5 
with a Txed step of 0.025 (because sustained phonation 
only occurs within this region), and ps varies in the range 
500 + 1500 Pa with a Txed step of 50 Pa. The resulting 
codebook contains 86125 vector pairs. Fig. 2 shows the 
distribution of the 7 computed acoustic parameters in the 
direct codebook. 


B. Codebook inversion and dynamic codebook access 


In order to solve the inverse problem, the direct code- 
book has to be inverted to obtain the inverse codebook. 
This however suffers from a non-uniqueness problem, 
i.e. an acoustic vector can be the key to one or more 
articulatory vectors. We tackle the problem by working 
on temporal sequences of acoustic vectors, rather than on 
a single vector. These may be obtained e.g. by analyzing a 
time-varying glottal t ow signal on a frame-by-frame basis. 
Given a sequence of acoustic vectors xy we want to obtain 
an “optimal” sequence of articulatory vectors vj in the 
inverse codebook: as already explained, xy is in principle 
associated with many candidate vectors vz, because of the 
non-uniqueness problem. In particular we perform a search 
in the acoustic space of the inverse codebook to Tnd the 
nearest vectors (according to the euclidean distance) to 
the given xx; the v? are therefore the articulatory vectors 
associated to these nearest vectors in the codebook. 

The optimal sequence of articulatory parameters is ob- 
tained by minimizing a cost function with three terms. An 
acoustic term accounts for the euclidean distance between 
£k and its discretized versions in the acoustic space of the 
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codebook (the vectors found by the search). An articula- 
tory term minimizes the euclidean distance between vj, and 
v;,_¡» Le. between every two articulatory vectors consecu- 
tive in time. This is the key term in the procedure, in order 
to obtain smooth parameter variations: it minimizes the 
“articulatory effort”, in accordance with the physiological 
muscle behavior. An accumulation term extends the cost 
function domain to the entire input sequence, so that the 
obtained articulatory sequence is optimal in a global way. 
The (simpliTed) cost function is: 


F (vz) = minfri [lex cell? + rallot — vki ll? + FOR) 
where 71,2 are weights for the acoustic and articulatory 
terms, respectively; ci are the discretized acoustic vectors 
close to xx. Dynamic programming techniques are the 
ideal tool for the minimization of the cost function: in 
particular the accumulation term would lead to exponential 
complexity, if not computed with this approach. 


C. Codebook clustering and interpolation with RBFNs 


One problem in the proposed procedure is that a target 
vector x, is typically not present in the inverse codebook, 
which is discrete; therefore every found vj, is not associated 
with xx, but only with a vector near to xy. The limitations 
of the discrete codebook can be overcome by interpolating 
the articulatory space; this allows to compute articulatory 
vectors associated exactly to the given xx. 

The interpolation uses RBFNs (Radial Basis Function 
Networks) [10]. Since RBFNs only interpolate functions 
and cannot handle multimaps, the inverse codebook has to 
be manipulated and the non-uniqueness problem avoided. 
We have developed a novel algorithm that subdivides the 
codebook in acoustic clusters and articulatory subclusters. 
Every cluster is associated to one or more subclusters. The 
algorithm guarantees that for every acoustic vector in a 
given cluster there will be only one (or none) articulatory 
vector in each associated subcluster. As a result in every 
subcluster the subdivided codebook provides a unique 
mapping, which is needed for RBFNs to work properly. 

The algorithm Trst subdivides the acoustic space in 
clusters C; using a standard technique. Random vectors, 
as many as the desired clusters, are generated and subse- 
quently moved with an iterative procedure [11] to become 
centroids. Centroids are iteratively displaced in such a 
way that the sum of the distances between every centroid 
and the associated vectors is minimized. Clusters C; are 
built by associating every acoustic vector with the nearest 
centroid. In order to obtain a uniform distribution of 
vectors in every cluster, the iterative procedure is applied in 
a two-stage fashion. Moreover, in order to ensure a certain 
degree of overlapping, the vectors which are closest to 
boundaries between two clusters are replicated in both. 

Once the acoustic clusters C; are built, the algorithm 
determines the s articulatory subclusters S; OG S138) 
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associated to each C;. Here s equals the maximum number 
of articulatory vectors associated to the same acoustic 
vector x* in C;. Every articulatory vector associated with 
x* is assigned to a distinct subcluster and used as a “seed”. 
The remaining articulatory vectors are allocated as follows. 
When many articulatory vectors vj, are associated to the 
same acoustic vector x,, every ol is assigned to a different 
subcluster, chosen as the one with the nearest articulatory 
centroid. The location of the subcluster centroid is updated 
after every new vector is added. 

Having determined the clusters C;, each associated with 
one or more subclusters Si, within every S; we construct 
four different RBFNs to interpolate each dimension of 
the articulatory space. Every acoustic vector associated to 
the subcluster is used as center for one RBF (gaussian 
functions in our application). Values for the parameters of 
the functions (standard deviation, etc.) are found after an 
extensive set of experiments on the codebook. After the 
determination of all the RBFNs, the articulatory space can 
be interpolated. The following procedure is used to feed the 
dynamic programming with interpolated vectors. Given an 
acoustic vector we Tnd the k nearest acoustic clusters and 
all the associated subclusters. The acoustic vector is used 
as input for the set of RBFNs in each subcluster. Finally, 
all the computed interpolated articulatory vector (as many 
as the subclusters) are passed to the dynamic programming 
procedures, which proceeds with the optimization. 


IV. RESULTS AND DISCUSSION 


The proposed algorithms were initially tested and tuned 
using artiTcial target sequences of acoustic vectors. These 
were used as input to the system to obtain the correspond- 
ing articulatory parameters. Results from these preliminary 
tests provided two main indications. First, the synthetic 
signals obtained by driving the physical model with the 
derived articulatory parameters follow closely the target 
acoustic vectors. Second, the derived muscular activations 
and subglottal pressure have physiologically plausible evo- 
lutions, i.e. they have smooth variations in time. These 
initial results conTrm the validity of the employed cost 
function, and of the RBFN interpolation. 

In order to test the proposed algorithms on real sig- 
nals, we have realized a complete synthesis-by-analysis 
procedure. Starting from a recorded utterance (a sustained 
vowel with varying pitch and voice quality) the signal is 
inverse Tltered with APARAT. The estimated glottal t ow 
is analyzed frame-by-frame and a sequence of acoustic 
vectors is obtained. The corresponding articulatory vectors 
(derived using the techniques described in Se. III) are used 
to drive the physical model, and the resynthesized glottal 
t ow is convolved with the time-varying formant Tlter of 
the vocal tract. The Tnal result is a resynthesis of the 
utterance, in which the evolution of pitch and voice quality 
are close to those of the original signal. 
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Fig. 3. Example of the analysis-by-synthesis procedure. (a) Time sequences of articulatory parameters retrieved by the optimization procedure (solid 
line: no RBFNs; dashed line: RBFNs). (b) Time sequences of glottal t ow acoustic parameters (dotted line: target sequences extracted from a recorded 
utterance; solid line: resynthesis without RBFNs; dashed line: resynthesis with RBFNs). 


Fig. 3 shows the performance of the synthesis-by- 
analysis procedure on a real utterance (a sustained /e/). The 
time-varying acoustic vectors obtained in the resynthesis 
follow with good accuracy the target ones, and informal 
listening tests conTrm that the resynthesis is qualitatively 
similar to the target signal. In particular the NAQ is usually 
well followed, as shown in Fig. 3(b). This is a positive 
result as the NAQ is known to be strongly related to voice 
quality [5]. The effect of using RBFNs can be noticed 
in Fig. 3(a): the sequences of articulatory vectors interpo- 
lated by RBFNs are smoother than those obtained using 
bare dynamic programming. A second advantage of using 
RBENs is that the amount of vectors that feeds the dynamic 
programming procedure is signiTcantly reduced and this 
leads to a corresponding decrease in the computation time. 


While the results reported in this work indicated that the 
proposed approach is effective in estimating control pa- 
rameters of the physical model, both with synthetic target 
data and with real utterances, a number of limitations are 
still hindering the performance of the estimation procedure 
described in this work. These are mainly related to intrinsic 
limitations of the two-mass model. Ranges of variation for 
the acoustic parameters are generally narrow (see Fig. 2), 
and are sometimes non realistic. RQ and NAQ in particular 
assumes exceedingly low values, due to poor description of 
the t ow at small glottal apertures, which results in abrupt 
glottal closure and exceedingly high absolute values of the 
tow derivative peak. The relationship between physical 
parameters of the models and acoustic parameters also 
need to be assessed: as an example, the relation between 
ps and Fo observed in the model is not in accordance with 
results reported in the literature. Finally, a more systematic 
approach to the determination of RBFNs parameters is 


needed in order to fully exploit the beneTts of interpolation 
in the codebook. 
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LARYNGEAL VOICE QUALITY CHANGES IN EXPRESSION OF 
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Abstract: In this study three different prominence and 
speech melody related effects on voice quality were 
studied using an inverse filtering based method. The 
hypothesis that prominence as a function of sentence 
and word stress is signaled with more pressed voice 
quality was tested. The results indicated discrepancies 
in the parameterization results, and the original hy- 
pothesis could not be confirmed. Instead, it is suggested 
that prominence is expressed with a more breathy voice 
quality. A possible physiological explanation for the 
phenomenon is also provided. 

Keywords: voice inverse filtering, prominence 


I INTRODUCTION 


Local increase of the voice fundamental frequency and 
attenuation of the respective changes elsewhere is used 
to signal prominence relations within utterances, phrases, 
words, or series of utterances. Fundamental frequency 
variation, however, is not the sole effect in the expression 
of prominence. Several studies have reported spectral or 
glottalization changes as an effect of prominence, suggest- 
ing that the changes are due to a tenser voice quality in 
prominent vowels [6]. 

Although Laver’s original definition of voice quality 
[10] attributed it to be caused by both laryngeal and 
supralaryngeal features of the voice production mecha- 
nism, it is nowadays often restricted to only reflect the la- 
ryngeal settings of speech. The major physiological source 
of these changes, in turn, is represented by the airflow gen- 
erated by the vibrating vocal folds, the glottal flow. Un- 
fortunately, direct measurement of this major source of 
voice quality is not possible from continuous speech due 
to the hidden position of vocal folds, located deep within 
the larynx and surrounded by several vital organs. Hence, 
the only feasible means to estimate the glottal flow from 
speech is to use a technique called inverse filtering. This 
implies that resonances of the vocal tract are cancelled 
from the speech pressure signal by feeding it through anti- 
resonances which have been defined from the underlying 
speech spectrum. The glottal flow is then parameterized 
using some time, amplitude, frequency, or model-based 
techniques to gather numerical data of the studied phenom- 
ena. 

Research on the function of the glottal flow has con- 
centrated mainly on isolated vowels. In contrast to this, 
there is surprisingly little evidence on how the glottal flow 
as the source of voice quality behaves in the sentence and 
word level in expression of stress. One such study was per- 
formed by Gobl, who studied LF model parameter fitting 
in a sentence, the focus of which varied [7]. The excita- 
tion parameter E, values were found to be larger in focal 


context, indicating voice quality changes in the expression 
of focus. Swerts and Veldhuis reported some evidence for 
correlation of FO and voice quality expressed by the first 
harmonic amplitude difference (AHj2, or H1-H2) in the 
context of speech melody [13]. However, they also cited 
several other studies regarding FO and OQ, in which con- 
tradictory views were presented, i.e. FO and OQ did not 
correlate, or exhibited negative correlation. Epstein stated 
that speakers use voice quality, as expressed by LF model 
parameter changes, to distinguish between prominent and 
non-prominent words in declarative and interrogative sen- 
tences [6]. 

The understanding of the behaviour of voice quality in 
expression of stress is limited to large extent by the lack 
of relevant methodologies to analyze the glottal flow from 
continuous speech. In order to address this issue, the cur- 
rent study utilizes TKK Aparat, a unified voice inverse fil- 
tering and parameterization package. Using this sophisti- 
cated speech research tool, the authors further tested the 
hypothesis that stress is expressed with a pressed voice 
quality. Hence, this study extends on the previous works 
on the topic which have utilized only a handful of speakers 
with restricted utterances or model-based parameters. This 
is performed using continuous speech and robust voice 
source parameters together with statistical analyses. 


II. MATERIALS AND METHODS 


Speech of healthy, native Finnish speakers was recorded. 
There were 11 speakers in total, of which 6 were women. 
The ages of the subjects ranged from 18 to 48 years, mean 
being 30 years. Two of the speakers smoked regularly or 
irregularly, while the others were non-smokers. 

The recordings were performed in an anechoic chamber. 
The speakers were standing, reciting the text from a paper 
attached on a sheet of cardboard. 

The speakers were equipped with a headset microphone 
consisting of a unidirectional Sennheiser electret capsule. 
The microphone signal was routed through a microphone 
preamplifier and a mixer to iRiver iHP-140 digital audio 
recorder. Low-frequency phase distortion introduced by 
the digital recorder was corrected by acquiring the input 
impulse response of the device using an MLS measurement 
[12] and convolving the recorded signals using a time- 
reversed version of the impulse response. 

The speech material consisted of three passages of 
Finnish text describing past weather conditions. The ma- 
terial was selected so that there were multiple [a] vow- 
els with different levels of prominence suitable for inverse 
filtering. The three different speech melody—and hence 
prominence—related conditions using a long [a] or [æ] seg- 
ment were chosen as follows: (1) A paragraph initial con- 
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tent word with a relatively high FO in a lexically stressed 
syllable (sentence stress condition). (2) The same segment 
in a later repetition of the word (word stress condition). (3) 
A long [a] in a lexically unstressed position. Each recita- 
tion took about one minute, and was repeated three times. 
The middle recitation was chosen for further processing. 
Three vowels from each passage, a total of nine, were then 
marked using Praat [4]. In total there were 3 - 3 - 11 = 99 
marked vowels. 

The phase-corrected recordings were high-pass filtered 
to remove any low-frequency noise in the signal and then 
cut into separate files containing only single vowels us- 
ing the time instants marked in Praat. Further processing 
of the segmented files was performed using TKK Aparat, 
which is a comprehensive voice inverse filtering and pa- 
rameterization software package, supporting two different 
inverse filtering methods, a multitude of time, amplitude, 
frequency, and model-based glottal flow parameters, seam- 
less interoperation with the MATLAB environment and 
easy exporting of data to statistical software packages [1]. 

The separated vowel files were inverse filtered using 
the iterative adaptive inverse filtering (IAIF) algorithm [3]. 
The flow diagram of the current version of IAIF, which 
is a slightly modified version from the previous ones, is 
shown in Fig. 1. Most notably, parametric spectral models 
used in various blocks of the flow diagram are computed 
with the discrete all-pole modeling (DAP) method [5] in- 
stead of the conventional linear predictive analysis. This 
reduces the bias of the harmonic structure of the speech 
spectrum in the formant frequency estimates. In block no. 
1 of Fig. 1, the speech signal is high-pass filtered using 
a linear-phase FIR filter to reduce any low frequency fluc- 
tuations captured during the recordings. Stages 2-6 form 
the first glottal flow approximation by making an estimate 
of the vocal tract transfer function and inverse filtering the 
signal with that estimate. The first approximation is used 
as a basis for stages 7-12, which roughly repeat the process 
of the earlier stages to yield the final glottal flow estimate. 

The inverse filtering process yields glottal flow esti- 
mates, an example of which is shown in Fig. 2. The glottal 
flow parameters of the vowel segments were computed au- 
tomatically from the estimated glottal flow. Even though 
all parameters implemented in TKK Aparat were acquired, 
further analysis was restricted to only NAQ and AQ pa- 
rameters. NAQ, the normalized amplitude quotient, and 
AQ, the amplitude quotient, measure time-domain charac- 
teristics of the glottal closing phase from two amplitude- 
domain quantities [2]. AQ is defined as AQ = Fe, where 
Aac 18 the maximum AC amplitude of the flow and dnin 
is the minimum of the flow derivative. Correspondingly, 
NAQ is defined as NAQ = 2, where To is the period 
length. Both AQ and NAQ correlate well to the pressed- 
ness of voice, which contributes considerably to the voice 
quality. 

The effect of various factors on the NAQ and AQ values 
was tested using analysis of variance (ANOVA). First, the 
values were log-transformed to correct the skew in param- 
eter distributions. Then, ANOVA was performed using the 
vowel running number, speaker sex, sentence stress, and 
word stress as dependent variables and the log-transformed 
AQ as the independent variable. All statistical treatments 
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Figure 1: The block diagram of the IAIF method for the 
estimation of the glottal excitation g(n) from the speech 
signal so(n). For further details of the different stages, 
please refer to [3]. 


were performed using the R statistical software environ- 
ment [9]. 


II RESULTS 


In general, inverse filtering analysis of continuous speech 
is problematic due to, for example, rapid changes in for- 
mant frequencies. In spite of these inherent difficulties, the 
analyses conducted in the current study were successful 
and reliable estimates of the glottal flow could be com- 
puted with the IAIF method for all the intended samples. 
Furthermore, during the inverse filtering process, a subjec- 
tive quality evaluation on a scale of 0-3 was given for each 
glottal flow estimate using the general shape of the result- 
ing glottal flow estimate as the criterion. This evaluation 
yielded a mean value of 2.4, which is considerably higher 
than in other inverse filtering studies conducted by the au- 
thors. 


Continuous speech/prosody 


137 


H| 10 ms ——— 


Figure 2: A representative sample of the glottal flow, acquired from the material of the current study. The sample represents an 
[a] vowel of a male speaker with no sentence or word stress. The measures required for the computation of NAQ and AQ are 


also illustrated. 


First, NAQ parameter values were inspected. Box-plots 
summarizing the values are shown in the left half of Fig. 3. 
The mean value of the NAQ parameters computed for both 
genders was 0.108, while the standard deviance (std.dev) 
equaled 0.024. The respective values for males and fe- 
males were 0.102 (0.020) and 0.113 (0.025), i.e. the val- 
ues were smaller for males. For males, in vowels without 
and with sentence stress, the values were 0.098 (0.020) and 
0.109 (0.019). Two-way ANOVA with word and sentence 
stress as factors indicated that this difference was not sta- 
tistically significant [F(1,42) = 3.65, p = 0.063]. For 
male vowels without and with word stress, the NAQ val- 
ues were 0.091 (0.016) and 0.107 (0.019), again indicating 
higher values for stressed cases. This result was statisti- 
cally significant [F(1,42) = 4.49, p= 0.040]. 


For females, the NAQ values were 0.109 (0.025) with- 
out sentence stress, and 0.120 (0.025) with it. The re- 
spective values without and with word stress were 0.105 
(0.020) and 0.117 (0.028). Again, the NAQ values were 
higher for stressed cases, but the results were not sta- 
tistically significant [F(1,51) = 2.72, p = 0.105 and 
F(1,51) = 0.532, p = 0.469, respectively]. 


Due to the NAQ values behaving contrary to the re- 
search hypothesis (see Section IV for details), AQ pa- 
rameter values were also studied. The summary box- 
plots are shown in the right half of Fig. 3. The mean 
AQ values for males and females were 0.841 (0.134) and 
0.567 (0.123), respectively, the considerably higher val- 
ues for males stemming mainly from the FO differences 
between males and females. For males, the values were 
0.890 (0.129) without sentence stress and 0.745 (0.087) 
with it. This difference was found statistically significant 
[F(1,42) = 16.0, p < 0.001]. For males without and 
with word stress, the values were 0.869 (0.126) and 0.828 
(0.138), respectively. This indicated lower AQ values in 
stressed cases for males. However, the result was not sta- 
tistically significant [F(1,42) = 0.858, p = 0.360]. For 
females, the values were 0.605 (0.114) and 0.489 (0.105) 
without and with sentence stress, and 0.609 (0.103) and 
0.545 (0.128) without and with word stress, respectively. 
Hence, the values were smaller in the stressed cases for fe- 


males as well. In the case of sentence stress, the result was 
statistically significant [F(1,51) = 14.0, p < 0.001], but 
not so in word stress [F(1,51) = 0.0762, p = 0.784]. 


IV CONCLUSIONS 


The NAQ values were higher in stressed than unstressed 
cases for both males and females, although for the ma- 
jority of cases, not statistically significantly so. Still, this 
suggested that stress would be expressed using a breath- 
ier voice quality. This contradicted the original research 
hypothesis predicting that stress would be expressed with 
a pressed voice quality. Therefore, AQ values were in- 
spected as well. As shown by the results, the AQ values 
behaved as expected, exhibiting smaller values in stressed 
vowels. ANOVA analyses showed this result to be statis- 
tically significant in the case of sentence stress, but not in 
the case of word stress. The authors suspect, however, that 
since the word stress appears to behave in a similar manner 
as sentence stress in the box plots, the lack of significance 
in ANOVA is only due to the small amount of material in 
the study. 

There is plenty of evidence which shows that chang- 
ing the glottal function from breathy towards pressed in 
sustained phonation is reflected by increase of FO and de- 
crease of both the absolute and the relative length of the 
glottal closing phase [e.g. 8]. In terms of NAQ and AQ, 
this implies that changing the phonation type from breathy 
to pressed in sustained phonation results in decrease of 
both of the parameters [2]. Interestingly, the current results 
on the glottal function in continuous speech showed a dif- 
ferent trend according to which AQ decreased in stressed 
vowels in comparison to unstressed ones whereas the val- 
ues of NAQ increased. Hence, the initial hypothesis that 
stress is expressed with a relatively more pressed voice 
quality could not be supported in this study. This unex- 
pected, yet highly interesting result might be explained 
by the behaviour of sub-glottal pressure. In sustained 
phonation, namely, a speaker is able to produce a long 
vowel by using a steady-state value of the sub-glottal pres- 
sure which, in parallel with glottal adductory forces con- 
trolled by the cricothyroid and thyroarytenoid muscles, re- 
sult in desired value of FO and voice quality. In continuous 
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Figure 3: NAQ and AQ box-plots. In the labels, the letter ’s’ stands for a stressed case and ’u’ for a unstressed one. The NAQ 
values are higher in stressed than in unstressed cases for both males and females. The AQ values show a large difference between 
males and females due to the intrinsically higher fundamental frequency of the females. Both in males and in females, AQ 


exhibits lower values in stressed vowels than in unstressed. 


speech, however, the speaker has to adjust continuously the 
function of the vocal apparatus in order to produce differ- 
ent utterances including both voiced and unvoiced sounds. 
This implies, importantly, that a sustained sub-glottal pres- 
sure value is not possible to be held in the production of 
vowels in continuous speech. However, the speaker is able 
to change FO by using the glottal adductory forces and this 
property can even be used to create fast changes in FO 
as evidenced by FO contours computed from continuous 
speech [e.g. 11]. With respect to the current results, the 
authors argue that the change from unstressed to stressed 
vowels caused a decreasing trend of AQ simply due to the 
increase of FO, that is, the shortening of the entire length 
of the glottal cycle. However, due to the lack of a sufficient 
level of sub-glottal pressure the shape of the glottal pulse 
became smoother when its cycle length was reduced. In 
other words, the speakers seemed to be unable to shorten 
the length of the glottal closing phase as effectively as they 
seem to be able to affect to the length of the entire glottal 
cycle. This, in turn, resulted in a breathier voice quality 
indicated by the higher NAQ value. 


Remarkably, when the references given by Swerts and 
Veldhuis regarding the effect of FO on OQ in inverse fil- 
tered speech are inspected more carefully, OQ appears to 
correlate positively with FO or remain constant only when 
samples of continuous speech are used. This supports the 
findings of this study. The studies performed on sustained 
vowels or artificial voicing tasks, on the other hand, are 
more conflicting. These notions suggest, in the authors’ 
opinion, that the results acquired by the study of sustained 
vowels should not be considered directly applicable to con- 
tinuous speech. 


Clearly, more work is required to gather comprehen- 
sive data regarding the voice source behaviour in natural 
speech. Such research should concentrate on recordings of 
continuous speech and should apply robust inverse filtering 
methods and reliable glottal flow parameterization meth- 
ods. The authors believe that TKK Aparat, the freely avail- 
able glottal flow examination software used in this study, 


provides tools suitable for further research on the topic. 
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Abstract: Some spectral transition features are 
introduced and tested in samples from dysarthric 
patients. The goal is to explore their potential as 
descriptors of articulatory deviations. This 
preliminary analysis includes only stop consonants 
extracted from the diadochokinetic task. Results and 
discussion are detailed for each one of the dysarthric 
groups included in the experiment. 
Keywords: Articulation, dysarthria, 
transitions 


spectral 


I. INTRODUCTION 


Dysprosody and articulation problems are evident in 
each of the different types of dysarthria [1]. While a lot 
has been done in order to find objective measures in the 
domain of voice quality, there is a great lack in the 
domains of articulation and prosody. Objective measures 
in speech face difficulties related to the assumptions 
inherent in signal processing, the variability of the signal 
amplified by the speech disorder, and these difficulties 
increase as more complex language- units, from 
phonemes to running speech are analyzed [2]. 

Earlier research has shown that the Maximum Spectral 
Transition positions are related to the perceptual critical 
points that contain the most important information for 
consonant and syllable perception [3][4]. Because it is 
applicable as well in voiced as unvoiced segments, 
allowing the analysis of complex units, it seems to be a 
suitable tool to explore articulation in dysarthric speech. 
Furui [3] introduced the perceptually essential interval as 
the minimal interval of the syllable necessary to ensure 
that no perceptual degradation in syllable identification is 
perceived compared with the original syllable and he 
proposed a spectral transition measure that can be used to 
measure the essential interval as illustrated in Fig. 1. His 
results revealed that syllable information for the 
consonants having a front constriction is more 
concentrated in a short period than for the consonants 
having a back constriction. 

This paper presents the results of a preliminary 
analysis of the stop consonants articulation by dysarthric 
patients from the Mayo Clinic Database, under the 
assumption that those perceived as distorted consonants 
will have shortened essential intervals or weak spectral 
transitions. 


II. METHODS 


The proposed features to characterize the strength and 
the duration of the spectral transitions are: 

e the essential interval, including its standard 
deviation and kurtosis over a sequence of 
pronunciations of the same syllable; 

e the slope of the spectral transitions and the 
slope’s standard deviation over the sequence; 

e the areas associated to spectral transition 
extrema. 

Fig. 1 shows the proposed transition features for a 
given syllable. 


Esperia le 


105 19 Wed y Spatial Tras far a able 


Fig. 1 Features extracted from the spectral transition 
interval for a CV syllable 


Speech samples of 58 dysarthric patients (native speakers 
of English) were chosen from Aronson’s original 
recordings of several types of dysarthrias. To explore the 
nature of the spectral transitions in the stop consonants, 
the diadochokinetic task was selected from those 
dysarthric groups that report imprecise or distorted 
articulation like the Flaccid, Spastic, Ataxic, Hypokinetic, 
and Hyperkinetic Dysarthrias types. 

The analysis is achieved by comparing the extracted 
features from the spectral transition measure proposed by 
Furui [3] with the subjective (perceptual) evaluation 
made by three experts on the articulation of the syllables 
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/pa/, /ta/ and /ka/. Tables with the correlation of each 
group of patients were constructed for each one of the 
syllables representing different articulators’ positions. 
Linear regressions with the features that exhibit a better 
correlation were made for the groups with common 
results. 


HI. RESULTS 


A. Dysarthric- groups correlations across syllables /pa/, 


/ta/ and /ka/. 


Flaccid and Spastic dysarthrias have better results in 
the /pa/ syllable and still correlate in the other two 
syllables because they are characterized by imprecise 
bilabials, labiodentals, lingual dentals, lingual alveolar, 
and weak pressure consonants. The Organic Voice 
Tremor and Palato Pharyngeo Laryngeo Myoclonus 
groups show consistent deviations in the articulatory— 
related dimensions. 

Table I shows the results of the analysis for sequences 
of utterances of the syllable /pa/. 


Table I. Feature correlations for the different dysarthric groups 
uttering sequences of /pa/ syllables. 


‘pa; E Estd Kurt Slope Sstd Areas Astd 
Corr. 


Flaccid -0,13 -0,83 -0,91 -0,16 0,16 -0,13 -0,73 


Ataxic -0,06 -0,01 0,10 0,58 0,15 -0,11 0,11 


Spastic -0,48 -0,74 0,36 -0,79 -0,70 -0,48 -0,69 


Chorea -0,42 0,05 -0,36 0,34 0,36 -0,29 -0,03 


Parkins. -0,28 0,25 0,18 0,53 -0,13 -0,28 -0,16 


O. Trem. -0,20 -0,78 0,41 -0,65 0,67 -0,20 -0,65 


Dystonia -0,22 -0,09 -0,15 0,42 0,04 -0,35 0,60 


ALS 0,39 -0,44 0,10 0,56 -0,19 0,39 -0,43 


PPLM -0,82 -0,92 0,77 -0,99 -0,93 -0,82 -0,92 


In Flaccid, and Spastic dysarthrias, the Organic Voice 
Tremor and Palato Pharyngeo Laryngeo Myoclonus 
groups, the Essential-Interval Standard Deviation (Estd) 
and the Slope represent an excellent linear regression 
with statistics R2= 0.8764 F=23.6300 p=0.0001, for 
/pa /. More detailed information can be found in [5]. 


Ataxic Dysarthria exhibits a moderate correlation with 
the slope in /pa/ syllables and a non significant 
correlation with any of the measurements in the /ta/ 
syllables, while it shows correlations with the slope and 
the standard deviation of the areas in /ka/. This can be 
related to the timing abnormalities that characterize this 
disorder and which are more evident in consonants with 
larger VOT, as it is reported by Duffy [1]. 


Chorea only exhibits a strong correlation with the slope 
in /ta/ syllables. Parkinson-samples correlate only during 
the /ka/ syllables with the standard deviation of the slope 
and the Kurtosis. The latest is the unique correlation 
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found for Dystonia. Irregular breakdowns, alternating 
motion rates, and tremor produce different effects in the 
imprecise articulation across the speakers, only when a 
major number of articulators are involved in the 
production of the sound, a correlation can be found. 


Since imprecise articulation in ALS is a type of mixed 
dysarthria (related to flaccid and spastic dysarthria), there 
is not a consistent tendency in this group of patients. The 
tendency to increase the stop gap duration in these 
speakers made it possible to determine strong correlations 
with the essential interval duration in /ta/ syllables and 
with the slope and its standard deviation in /ka/ syllables. 
Together with Flaccid and Spastic groups, for ALS 
subjective evaluations, the Essential Interval kurtosis 
(Kurt) and the Areas represent an excellent linear 
regression. Figure 2 shows the residuals plot for the linear 
regression and Table II summarizes the mean values of 
the expert judgments for the patients and the values for 
the parameters Kurt and Areas. The regression statistics 
are R2= 0.7638, F=19.4012, p=0.0002. 


Residual Case Order Plot 


0.5 4 


Residuals 
o 


-0.5L J 


Case Number 


Fig. 2 Residual case order plot for the /ka/ syllables 


Table II. Patients for dysarthric groups with high correlations in 


/ka / syllables. 
FD.SD.ALS | Judgment | Areas Kurt 
1 2.666667 7.25 | 1.3098 
2 2 8 | 3.2286 
3 2 7 1.202 
4 0.666667 11 1.5 
5 1 8.5 | 2.5266 
6 0.666667 11 | 3.5503 
7 1.333333 11.5 | 2.0243 
8 2 8 | 2.5288 
9 3 7 | 1.7501 
10 2 8.75 | 1.9744 
11 3 4.5 1.5 
12 2 7 1.5 
13 3 5 1.5 
14 3 5:5 1.5 
15 2.333333 5.25 | 1.9554 


Continuous speech/prosody 


The parameters found in the syllables with backward 
constriction offer more information in general than the 
same set of parameters in the other articulatory positions. 
This might be related to the duration of the syllables 


transitions. 


Essential intervals for syllables with /k/ are 


up to 20 ms longer than the others, according to Fig. 10 in 


[3]. 


B. The most important parameters 


A summary of the most important parameters for each 
of the three articulatory positions, together with the 
typical description of the disorder for each dysarthric 


group, is presented in table III. 


Table III. Best correlating parameters with the subjective 
evaluations for each type of dysarthria in /pa/, /ta/, /ka/ syllables 


Dysarthria Articulation /pa/ Ita/ /ka/ 
Flaccid Imprecise* Kurt El, Areas 
(multiple Imprecise (-0,91) (-0,84) 
cranial bilabials, 
nerves) labiodentals, 
n=5 lingual 
dentals, 
lingual 
alveolar, 
vowels, glides 
and liquids. 
Weak pressure 
consonants** 
Spastic Slow: Slope Slope El 
(Pseudo - imprecise* (-0,79) | (-0,90) (-0,89), 
bulbar) Sound Areas 
n=5 pressure level (-0,90) 
contrasts in Kurtosis 
consonants. (-0,78), 
Amplitude of Sstd 
release bursts (0,78) 
for stops. 
Duration of 
phoneme to 
phoneme 
transitions** 
Mixed Slow: EI Slope (- 
(ALS) imprecise* (0,76) 0,99), 
n=6 Reduced Sstd (- 
strength of lip; 0,80) 
tongue and 
jaw** 
Ataxic Irregular Astd 
n=8 breakdowns* (0,69), 
Imprecise Slope 
consonants** (-0,61) 
Hypokinetic Accelerated: Slope Kurtosis 
(Parkinsonism) | imprecise* (-0,57) (-0,83), 
n=8 Failure to Sstd 
completely (-0,80) 
reach the (n=5) 
articulatory 
targets or 
sustain 
contacts for 
sufficient 


durations** 
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Hyperkinetic Fluctuating Astd Kurtosis 
(Dystonia) distortions (0,60) (0,76) 
n=10 (slow)* (n=9) 
(Chorea) Abrupt, Slope 
n=8 intermittent (0,72) 

distortions 

(quick)* 
(Organic Normal: Slope Slope Slope 
Tremor) secondary (-0,65) | (-0,72) (0,87) 
n=4 irregular , Sstd 

breakdowns* (0,67) 
(Palato — Normal, or Slope Sstd -1 (n=2) 
Pharyngeo — flaccid, (-0,99) (-0,99) 
Laryngeo spastic, or 
Myoclunus) ataxic 
n=3 dysarthria* 
*Dysarthria: Differential Diagnosis. Arnold E. Aronson. Types Volume 
1, Mentor Seminars 1993. 
** Motor Speech Disorders: Substrates, Differential Diagnosis, and 
Management. Joseph R. Duffy. Mosby 1995. 


IV. DISCUSSION 


According to the results summarized in table III and 
the linear regressions, our initial assumption is valid. The 
transitions are shorter, weaker and unstable as the 
severity score increases for the samples coming from 
Flaccid, Spastic and Mixed dysarthric groups. The 
proposed features are evidence of it in the following way: 


1. The mean value of the slopes in the transitions of 
/pa/ and /ta/ syllables diminishes while the severity 
scores increase (Spastic, PPLM, OT, and Flaccid). 


2. The mean value of the essential intervals or in the 
areas in the spectral transitions of /ka/ syllables 
diminishes while the severity scores increase 
(Flaccid, Spastic, and ALS). 


3. The kurtosis of the essential intervals and the 
standard deviations in the transitions of /ka/ 
syllables are evidence of instabilities in the 
syllable repetitions beyond the normal speech 
variability (Flaccid, Spastic, Dystonia, Parkinson, 
and ALS). 


Literature describes articulatory difficulties in those 
groups from the stop consonants and the involved 
articulators (Table III). It supports our results for the 
selected sounds. 


Those groups, whose articulatory deviations are 
described as irregular breakdowns, fluctuating, or 
intermittent distortions, lack the regularities observed in 
the groups mentioned above; in their case, we found only 
modest correlations between the subjective evaluations 
and the parameters. 
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V. CONCLUSION 


A preliminary analysis of the stop consonants 
articulation by dysarthric patients using spectral transition 
measurements was presented. The results show that the 
proposed features open a possibility for the acoustic 
measurement of imprecise articulation. 

Taking into account that experts evaluate articulation 
mainly on running speech or on a reading task, a new 
algorithm must be developed to extract the features under 
those conditions for all kinds of consonants. 

Standards for healthy speakers for these features are 
necessary to allow a detailed analysis and a clinical 
interpretation. The features may then be included as part 
of an expert system to give a severity ranking to 
disordered speech. 
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Abstract: Our aim is to study the vocal expression of emotion 
in real-life spoken interactions in order to build emotion 
detection system. We make use of a corpus of naturally- 
occurring dialogs recorded in a real-life emergency medical 
call center. The context of emergency gives a large palette of 
complex and mixed emotions. About 30% of the utterances 
are annotated with non-neutral emotion labels on this 
medical corpus. The complexity of the emotion recognition 
task increases the higher the number of classes and the finest 
and closest these classes are. Finding relevant features of 
various types such as speech disfluencies or affect bursts 
becomes essential in order to improve the detection 
performances. Our experiments focus on a task of 
discriminating 2 to 5 emotions, Fear, Anger, Sadness, 
Neutral and Relief. 


Keywords: Emotions, real-life spoken interactions, 
detection system, medical call center. 


I. INTRODUCTION 


This decade has seen an upsurge of interest in affective 
computing. Speech and Language are among the main channels 
to communicate human affective states. Affective Speech and 
language processing can be used alone or coupled with other 
multimodal channels in many systems such as call centers, 
robots, artificial animated agents for telephony, education, 
medical or games applications. Affective corpora are then 
fundamental both to developing sound conceptual analyses and 
to training these 'affective-oriented systems' at all levels - to 
recognise user affect, to express appropriate affective states, to 
anticipate how a user in one state might respond to a possible 
kind of reaction from the machine, etc. Our aim is to study the 
vocal expression of “emotion” in real-life spoken interactions 
in order to build emotion detection system. 


In the computer science community, the widely used terms of 
emotion or emotional state are used without distinction from the 
more generic term affective state which may be viewed as more 
adequate from the psychological theory point of view. This 
“affective state” includes the emotions / feelings / attitudes / 
moods / and the interpersonal stances of a person. There is a 
significant gap between the affective states observed with 
artificial data (acted data or contrived data produced in 
laboratories) and those observed with real-life spontaneous data. 
Most of the time, researches are done on a sub-set of the big-six 
“basic” emotions described by Ekman [1] and on prototypical 
acted data. In the artificial data, the context is “rubbed out” or 
“manipulated” so we can expect to have much more simple full- 
blown affect states which are quite far away from spontaneous 
affective states. The affective state of a person at any given time 


is a mixture of emotion/ attitude/ mood /interpersonal stance 
with often multi-trigger events (internal or external) occurring at 
different times: for instance a physical internal event as a 
stomach-ache triggering pain with an external event as 
“someone helping the sick person” triggering relief. Thus, far 
from being as simple as “basic emotion”, affective states in 
spontaneous data are a subtle blend of many more complex and 
often seemingly contradictory factors that are very relevant to 
human communication and that are perceived without any 
conscious effort by any native speaker of the language or 
member of the same cultural group. 


The first challenge when studying real-life speech data is to find 
the set of appropriate descriptors attributed to an emotional 
behaviour. For a recent review of all emotion representation 
theories, the reader is referred to the Humaine NoE 
(www.emotion-research.net). Several studies define emotions 
using continuous abstract dimensions: Activation-Valence or 
Arousal-Valence-Power. But these three dimensions do not 
always enable to obtain a precise representation of emotion. For 
example, it is impossible to distinguish fear and anger. 
According to the appraisal theory [2], the perception and the 
cognitive evaluation of an event determine the type of the 
emotion felt by a person. Finally, the most widely used approach 
for the annotation of emotion is the discrete representation of 
emotion using verbal labels enabling to discriminate between 
different emotions categories. We have defined in the context of 
Humaine, an annotation scheme “Multi-level Emotion and 
Context Annotation Scheme” [3, 4] to represent the complex 
real-life emotions in audio and audiovisual natural data. This 
scheme is adapted to each different task. We are also involved 
as expert in the W3C incubator group on emotion 
representation. 


The second challenge is to identify relevant cues that can be 
attributed to an emotional behaviour and separate them from 
those that are simply characteristic of spontaneous 
conversational speech. A large number of linguistic and 
paralinguistic features indicating emotional states are present in 
the speech signal. The aim is that of extracting the main voice 
characteristics of emotions, together with their deviation which 
are often present in real spontaneous interaction. Among the 
features mentioned in the literature as relevant for characterizing 
the manifestations of speech emotions, prosodic features are the 
most widely employed, because as mentioned above, the first 
studies on emotion detection were carried out with acted speech 
where the linguistic content was controlled. At the acoustic 
level, the different features which have been proposed are 
prosodic (fundamental frequency, duration, energy), and voice- 
quality features [5]. Additionally, lexical and dialogic cues can 
help as well to distinguish between emotion classes [3, 7, 8, 9]. 
The most widely used strategy is to compute as many features as 
possible. All the features are, more or less, correlated with each 
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other. Optimization algorithms are then often applied to select 
the most efficient features and reduce their number, thereby 
avoiding making hard a priori decisions about the relevant 
features. Trying to combine the information of different natures, 
paralinguistic features (prosodic, spectral, etc.) with linguistic 
features (lexical, dialogic), to improve emotion detection or 
prediction is also a research challenge. Due to the difficulty of 
categorization and annotation, most of the studies have only 
focused on a minimal set of emotions. 


In this study, we show that by using a large number of different 
features, we can improve performances obtained with only 
classical prosodic features. Section 2 describes the corpus of 
real-life data. Section 3 is devoted to the description of the 
features used. In section 4, the methods for training models are 
briefly described. Section 5 summarizes our results which are 
then discussed. 


II. REAL-LIFE DATA 


In the context of emergency, emotions are not played but really 
felt in a natural way. The aim of the medical call center service 
is to offer medical advice. The agent follows a precise, 
predefined strategy during the interaction to efficiently acquire 
important information. The role of the agent is to determine the 
call topic, the caller location, and to obtain sufficient details 
about this situation so as to be able to evaluate the call 
emergency and to take a decision. In the case of emergency 
calls, the patients often express stress, pain, fear of being sick or 
even real panic. In many cases, two or three persons speak 
during a conversation. The caller may be the patient or a third 
person (a family member, friend, colleague, caregiver, etc.). 


The corpus (Table 1) contains 688 agent-client dialogs of 
around 20 hours (271 males, 513 females). The corpus has been 
transcribed following the LDC transcription guideline. 


Table 1. Corpus Description 


#agents 7 (3M, 4F) 
#clients 688 dialogs (271M, 513F) 
#turns/dialog Average: 48 
#distinct words 9.2k 
#total words 262 k 


Some additional markers (Table 2) have been added to denote 
named-entities, breath, silence, intelligible speech, laugh, tears, 
clearing throat and other noises (mouth noise). 


Table 2. Number of the main non-speech sounds markings on 
20 hours of spontaneous speech. 


#laugh 119 
#tear 182 

# « heu » 7347 
#mouth noise 4500 
#breath 243 


The use of these data carefully respected ethical conventions 
and agreements ensuring the anonymity of the callers, the 
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privacy of personal information and the non-diffusion of the 
corpus and annotations. 


In our experiment, we define one list of emotion labels using a 
majority voting technique. A first list of labels was selected out 
of the fusion several lists of emotional labels defined within 
HUMAINE (European network on emotion http://emotion- 
research.net/). In a second step, several judges rated each 
emotion word of this list with respect to how much it sounded 
relevant for describing emotions present in our corpus. 


We have defined an annotation scheme “Multi-level Emotion 
and Context Annotation Scheme” [3, 4] to represent the 
complex real-life emotions in audio and audiovisual natural 
data. It is a hierarchical framework allowing emotion 
representation at several layers of granularity (Table 3), with 
both dominant (Major) and secondary (Minor) labels and also 
the context representation. This scheme includes verbal (from 
the predefined list), dimensional and appraisal labels. 
Representing complex real-life emotion and computing inter- 
labeler agreement and annotation label confidences are 
important issues to address. A soft emotion vector is used to 
combine the decisions of the several coders and represent 
emotion mixtures [3, 4]. This representation allows to obtain a 
much more reliable and rich annotation and to select the part of 
the corpus without blended emotions for training models. Sets of 
“pure” emotions or blended emotions can then be used for 
testing models. About 30% of the utterances are annotated with 
non-neutral emotion labels on this medical corpus (Table 4). 


Table 3. Emotion classes hierarchy: multi-level of 
granularity 


Coarse level 
(8 classes) 


Fine-grained level 
(20 classes + Neutral) 
Fear, Anxiety, Stress, Panic, 


Fear 


Embarrassment 
Annoyance, Impatience, ColdAnger, 
ee HotAnger 
Sadness Sadness, Dismay, Disappointment, 
Resignation, Despair 
Hurt Hurt 
Surprise Surprise 
Relief Relief 
Interest Interest, Compassion 


Other Positive Amusement 


Table 4. Repartition of fine labels (688 dialogues). Other gives 
the percentage of the 15 other labels. Neu: Neutral, Anx: 
Anxiety, Str: Stress, Rel: Relief, Hur: Hurt, Int: Interest, 
Com:Compassion, Sur: Surprise, Oth: Other. 


Caller Neu. Anx. Str. Rel. Hur. Oth 
10810 67.6% 17,7% 65% 2.7% 1.1% 4.5% 
Com. Ann. Sur. Oth 


Agent Neu. Int. 
11207 892 6.1% 1.9% 1,7% 0.6% 0.6% 


Continuous speech/prosody 


The Kappa coefficient (measuring the inter-labeler agreement) 
was computed for agents (0.35) and callers (0.57). The 
following experiments have been carried out on the callers 
voices for the coarse classes: Fear, Anger, Sadness, Relief and a 
“Neutral” state. 


III. FEATURES 


Prosodic features (mainly FO and Energy) are classical features 
used in a majority of experiments on emotion detection. For 
accurate emotion detection in natural real-world speech dialogs, 
not only the prosodic information must be considered. 


We use non-verbal speech cues such as speech disfluencies and 
affect bursts (laugh, tear, etc.) as relevant cues for emotion 
characterization. For example, we considered the autonomous 
main French filler pause "euh" as a marker of disfluency. It 
occurs as independent item and it has to be differentiated from 
vocalic lengthening. We correlate the filler pause with emotions 
in [10]. This correlation follows the orthographic (lexical) 
transcription of the dialogs and considers the number of 
occurrences of transcribed "euh" per emotion class. In [10], 
"euh" was correlated mainly with Fear sentences, followed by 
Anger sentences and finally the other emotions. In [11], affect 
bursts such as laughter or mouth noise are shown to be also 
helpful for emotion detection. 


Since there is no common agreement on a top list of features and 
the feature choice seems to be data-dependent, our usual 
strategy is to use as many features as possible even if many of 
the features are redundant, and to optimize the choice of features 
with attribute selection algorithms. In the experiments reported 
in this paper, we divided the features into several types with a 
distinction between those that can be extracted automatically 
without any human intervention (prosodic, spectral features, 
microprosody) and the others (duration features after automatic 
phonemic alignment, features extracted from transcription 
including disfluencies and affect bursts). 


Our set of features includes very local cues (such as for instance 
the local maximums or inspiration markers) as well as global 
cues (computed on a segmental unit) [12]. In Table 5, we 
summarize the different types of features and the number of 
cues used in our experiments. 


We distinguish the following sets of features: 


- “Blind”: automatic features extracted only from audio signal 
including paralinguistic features (prosodic, micro-prosodic, 
formants) 


The Praat program [13] was used for the extraction of prosodic 
(FO and energy), microprosody and spectral cues. It is based on 
a robust algorithm for periodicity detection carried out in the lag 
auto-correlation domain. Since FO feature detection is subject to 
errors, a filter was used to eliminate some of the extreme values 
that are detected. Energy, spectral cues and formants were only 
extracted on voice parts (i.e.: parts where Praat detects FO). The 
paralinguistic features were normalized using Z-norm: 
zNorm(P) = (P-mean(P))/std(P). The aim is to erase speaker- 
differences without smoothing variations due to emotional 
speech. 
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- “Transl”: duration features from phonemic alignment 


For the moment we only extracted Duration features from the 
phonetic transcription, mean and maximum phone duration, 
phonemic speech rate (#phones/ turn length), length (max and 
mean) of hesitations. 


- “Trans2”: features extracted from the transcription 


Non linguistic event features: inspiration, expiration, mouth 
noise laughter, crying, number of truncated words and 
unintelligible voice. These features are marked during the 
transcription phase. 


Table 5: Summary of the feature types 


Feature type 
FO related 
Energy 
Blind 
Microprosody 14 
Duration features from 
Transl Regno 11 
phonemic alignment 
Speech disfluencies an 
11 
Trans2 


Spectral & | Bandwidths| 18 
formant 
related 30 
affect burst from 
transcription 
The set of features described in section 3 is computed for all 
emotion segments in order to compare the performances that can 
be achieved using one type only and study the gain that can be 
added by mixing them. Therefore, we have focused on the 
performances that could be obtained using prosodic, spectral, 
disfluency and non-verbal events cues. The same train and test 
sets are used as for all experiments. Several studies have shown 
Support Vector Machine [14] (search of an optimal hyperplan to 
separate the data) to be an effective classifier for emotion 
detection. A SVM Gaussian classifier was therefore used for all 
experiments with the software weka [15]. Because SVM are 
two-class classification, the multi-class classification is solved 
using pairwise classification. Detection results are given using 


the CL score (class-wise averaged recognition i.e. average of the 
diagonal of the matrix). 


Formants 


IV. METHODS 


V. RESULTS AND DISCUSSION 


With only blind features and without any knowledge about the 
speech transcription, we obtained a detection rate of 45% on 
these 5 emotions. Still, the more emotional classes there are, the 
more different cues will be needed to achieve good detection 
rates. By adding knowledge (Fig. 1) derived from the 
orthographic transcription (disfluencies, affect bursts, phonemic 
alignment) and after the selection of the best 25 features, we 
achieved 56% of good detection for the same 5 emotions. 
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Features from all the types were selected among the 25 features: 
15 features in the Blind set, 4 in Trans! and 6 in Trans2. 
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Figure 1: CL score for the 5 classes Fear, Anger, 
Sadness, Relief and Neutral with different set of cues; Blind: 
all parameters extracted automatically (FO, Formants, 
Energy, microprosody); Trans1: durations from phonemic 
alignment, Trans2: parameters extracted from the manual 
transcription, all: everything 25-best : 25 best features 


The experiments described in Fig. 2 focus on a task of 
discriminating 2 to 5 emotions among Fear, Anger, Sadness, 
Neutral and Relief. 


f 


Fe/Ag/N Fe/Sd/N Fe/Ag/Sd/Re Fe/Ag/Sd/Re/N 


Fe/N Fe/Sd Ag/N Ax/St Fe/Ag Sd/N 


Figure 2: Performances from 2 emotions to 5 emotions (Fe: 
Fear, N: Neutral state, Ag: Anger, Sd: Sadness, Re: Relief) 


The complexity of the recognition task increases the higher the 
number of classes and the finest and closest these classes are. 
For only two emotions (such as Anger/Neutral or Fear/Neutral), 
we obtained with our best system more than 80% of good 


MAVEBA 2007 


detection. In conclusion, finding relevant features of various 
types becomes essential in order to improve the emotion 
detection performances on real-life spontaneous data. Some of 
these features such as affect burst or disfluencies could be 
detected automatically without any speech recognition. Future 
experiments will be devoted to the automatic detection of such 
features. 
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Abstract: The objective is to describe analysis methods 
that enable tracking vocal dysperiodicities in running 
speech. Vocal dysperiodicities here refer to deviations 
from strict periodicity in voiced speech sounds. Two 
methods are described. They respectively enable the 
sample-by-sample extraction of vocal noise from the 
speech signal or the isolation of speech cycles in voiced 
segments to quantify perturbations of the cycle 
lengths and amplitudes (i.e. cycle duration jitter and 
amplitude shimmer). These methods share the 
property that they are not based on the assumption 
that the signal is locally periodic and that the average 
period length is known a priori. 

Keywords: Vocal noise, jitter, running speech 


I. INTRODUCTION 


The objective of the presentation is to describe analysis 
methods that enable tracking vocal dysperiodicities in 
running speech. Vocal dysperiodicities here refer to 
deviations from strict periodicity in voiced speech 
sounds. The description of vocal dysperiodicities is a 
common practice in the framework of the clinical 
assessment of vocal function. 

Acoustic descriptors of vocal dysperiodicity are 
temporal or spectral. Frequently, they are extracted from 
sustained speech sounds. Privileging steady sounds when 
analyzing vocal disturbances is a matter of technical 
feasibility rather than clinical relevance. It is indeed the 
case that existing clinical voice analysis software is able 
to deal with sustained sounds only or is known to fail on 
speech produced by severely hoarse speakers. The reason 
is that many analysis methods are based on the hypothesis 
that the analyzed sounds are locally periodic. This is an 
assumption that is not valid under all circumstances, 
however [1]. 

Therefore, we have developed methods that enable 
estimating vocal dysperiodicities in speech that is not 
steady and that may be produced by severely hoarse 
speakers. The methods that are described make possible 
the sample-by-sample extraction of vocal noise from the 
speech signal, as well as the isolation of speech cycles in 
voiced segments to quantify perturbations of the cycle 
lengths and amplitudes (i.e. cycle duration jitter and 
amplitude shimmer). Descriptors of vocal jitter and 


shimmer differ from descriptors of vocal noise in general 
insofar that they focus on modulation noise exclusively. 

Generally speaking, the description of vocal jitter and 
shimmy is regarded to be meaningful only when the 
speech segments are pseudo-periodic. At this stage, it is 
not clear whether these limitations are the consequence of 
a lack of reliability of existing signal analysis methods or 
a lack of validity of the extracted vocal cues. 

Two methods are described. They share the property 
that they are not based on the assumption that the signal 
is locally periodic and that the average period length is 
known a priori. The first method enables tracking noise 
(whatever the cause) in any speech sound produced by 
any speaker. 

The second method consists in a multi-resolution 
analysis of the signal samples in terms of their salience. 
Sample salience designates the duration over which a 
signal sample is a maximum. Salience is a relevant signal 
feature because one observes that signal peaks that are 
similarly positioned in vocal cycles may have similar 
saliences even if the peak amplitudes differ widely. This 
also applies to peaks in cycles the durations of which are 
perturbed moderately. The salience of signal peaks can 
therefore be used to detect automatically voiced speech 
cycles because they display a preeminent peak in the 
vicinity of glottal closure. 


II. METHODS 
A. Extraction of vocal dysperiodicities 


The method is based on the observation that when in a 
2-dimensional graph one reports on the horizontal axis 
samples of a noise-free periodic signal and on the vertical 
axis samples that are identically positioned in an adjacent 
period then all sample pairs (x,y) are located on the 
bisector of the graph. 

In a noisy signal, pairs (x,y) remain in the vicinity of 
the bisector, as shown in Fig.1. The cumulated distance 
between pairs and bisector over an analysis frame is a 
measure of the total signal noise in that frame and the 
individual distances between each pair and the bisector 
are sample-by-sample estimates of the noise (whatever its 
cause). 

In practice, a sliding rectangular analysis window of 
2.5 ms is used and auxiliary windows are time-shifted to 
the left and right to minimize the cumulated distance of 
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all sample pairs to the bisector. The positioning of 
analysis frames to the left and right of the main analysis 
window avoids comparing signal fragments that do not 
belong to the same phonetic segment because the 
minimum distance is retained as a measure of vocal noise 
[2, 6]. 

Before the calculation of the individual and cumulated 
distances, the within-window signal fragments are 
energy-normalized and their averages are removed. 
Energy-normalization enables compensating slow 
amplitude variations and average-normalization enables 
removing offsets. Without energy- and average- 
normalization sample pairs would be aligned on a straight 
line with a slope different from one and displaced from 
the origin. 

An algebraic formulation of the procedure outline 
above shows that it is equivalent to the calculation of the 
variogram of the speech signal involving a current and 
left- and right-positioned analysis frames. The variogram 
is minimal for the shift of the auxiliary analysis window 
that minimizes the cumulated distances to the bisector 
[5]. 

To obtain vocal dysperiodicity estimates over a 
complete signal, the main window is shifted without 
overlap or gap and the variogram analysis is repeated as 
often as necessary. 


0.1 r z 


auxiliary frame 
o 
T 


h 
-0.1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 
current frame 


Figure 1: Auxiliary versus main window samples for 
one frame of vowel [a] in sentence S1 produced by a 
female normophonic speaker. 


B. Global and segmental signal-to-dysperiodicity 
ratios 


The vocal noise is summarized by means of global and 
segmental signal-to-dysperiodicity ratios (1), 
dysperiodicity e(n) being the distance of a sample pair to 
the bisector. 


L-1 
24m) (1) 
SDR =10log 2 


L-1 


Vem 
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The global ratio involves the log-ratio of the signal and 
dysperiodicity energies over the whole signal duration. 
The segmental ratio involves the average of the log-ratio 
(1) computed for analysis segments of 5 ms. The latter is 
frequently used to summarize signal degradation owing to 
lossy coding. The reason is that it is expected to correlate 
better with perceived loss of signal quality than the global 
log-ratio [4]. 


C. Computation of the speech sample salience 


The sample salience is defined as the longest interval 
over which a sample is a maximum. The estimation of the 
salience consists in considering all possible within-array 
analysis intervals and noting how often a sample is a 
maximum within each. Boundary effects are taken into 
account by rotating N times the samples within an array 
of length N so that each sample occupies once the left and 
right boundary positions. 


Table 1: Illustration of a multi-resolution salience 
computation of an array (in bold). 


1 2 0 4361 2 1 1204 3 6 121 


1315191 3 I 
2 1 4 19 1 4 2 1 
1 3 191 4 2 13 
2 191 4 1141 
19 1 3 2 1414 
91 2 113141 
16 2 13 18 19 
5 11217191 
2 14161912 


1 31 1 48 1 9 1 3.7 16 


The calculation of the sample salience involves the 
following steps. The handling of boundary effects is 
discussed later. 


1. Initialization of all sample saliences to one. 


2. Division of the array length N into analysis intervals 
of length 2. The rightmost interval stops at the 
rightmost array boundary whatever its length (i.e. 1 or 
2). 


Determination of the maximum within each interval. 
Assignment of a salience of 2 to the interval maxima. 


Increase of the interval length by one. 


DS Ro 


Division of the array length N into analysis intervals 
of length n. The length of the rightmost analysis 
interval is comprised between 1 and n. 


Determination of the interval maxima. 


Assignment of salience n to each interval maximum. 


Continuous speech/prosody 


9. Looping back to step 5. 
10. Stop when the analysis interval length equals N. 


The position of the analysis array within the signal may 
be arbitrary and the saliences of the samples in the 
rightmost interval are affected by the anomalous interval 
lengths. To obtain sample saliences that are less 
dependent on position, the N samples in the analysis 
array are rotated N times so that each sample is 
positioned once at the right and left boundaries, and the 
sample salience is calculated for each within-array 
rotation. The final sample salience is the average of the 
saliences computed for each rotation. 

In practice, rotation is carried out by copying the 
analysis array to the right and shifting the array stepwise 
from left to right N times. Tab.1 illustrates obtaining the 
sample salience for an array of length 9. Each line in 
Tab.1 gives the sample saliences for one array position. 
The last line gives the final average saliences, which are 
considered to be independent of the sample positions with 
regard to the array boundaries. 


D. Extraction of the vocal cycle lengths and 
amplitudes 


Preprocessing: The speech signal is low-pass filtered 
to remove additive noise as well as high-frequency 
formants. A zero phase filter is used to prevent phase 
distortion. The cut-off frequency is 900Hz. 


Multi-resolution analysis: The cycle positions are 
determined on the base of the main cycle peaks that occur 
in the vicinity of glottal closure. These are extracted by 
computing the salience of each signal sample and 
discarding those samples that are not peaks. 

The main cycle peak sequence is extracted by taking 
into account the peaks one by one in the order of 
decreasing salience. For each peak sequence the 
coefficient of variation of the inter-peak durations is 
computed. The peak sequence giving rise to a minimal 
coefficient of variation is retained. The search interval for 
the minimum is fixed by the frequency band 50Hz to 
400Hz in which the average vocal frequency is expected. 


Salience analysis is performed twice, once for each 
polarity of the signal and the polarity giving the smallest 
coefficient of variation is retained. 


E. Corpus 


The corpus comprises sustained vowels [a] [i] and [u], 
as well as four sentences spoken by 22 normophonic and 
dysphonic speakers. Two of the sentences involve voiced 
segments exclusively and the other voiced and unvoiced 
segments. The four sentences are matched grammatically 
and have the same number of syllables. Seven judges 
have determined the degree of perceived overall deviation 
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from modal voice (i.e. grade) in the framework of a 
compared-items paradigm [3]. 


HI. RESULTS AND DISCUSSION 


A. Vocal noise 


Vocal noise has been extracted by means of the 
algorithm described in section II.A. Global and segmental 
signal-to-dysperiodicity ratios have been computed and 
correlated with perceived degrees of hoarseness (grade). 
Tab.2 summarizes the Pearson correlation coefficients 
between perceived degree of hoarseness and global and 
segmental signal-to-dysperiodicity ratios. 


Table 2: Pearson’s correlation coefficients between 
average hoarseness scores and global and segmental 
signal-to-dysperiodicity ratios for sustained vowel [a] 
and sentences S/-S4 obtained via energy-equalized 
(GV) and energy- and average-equalized variograms 
(AGV) 


a|s [s2 |s [s] 
Segmental] -0.70 | -0.85 | -0.79 | -0.80 -0.66 | 


"n 


The results show that, for sustained vowels as well as 
spoken sentences, the global and segmental signal-to- 
dysperiodicity ratios correlate with the perceptual ratings. 
One observes that when energies as well as averages of 
the signal analysis frames are equalized, the correlation 
with perceived degree of hoarseness is increased. The 
increase is more marked for the global signal-to- 
dysperiodicity ratio. An explanation for this observation 
is discussed hereafter. 

In the speech signal one occasionally observes large- 
amplitude, low-frequency “pop” noise caused by breath 
hitting the microphone housing. These parasitic transients 
are low-frequency and ignored or not perceived by 
human listeners. The energy of such low-frequency 
transients may be comparable to the total signal energy, 
however. The impact of such parasitic low-frequency 
pops is greater on the global signal-to-dysperiodicity ratio 
than on the segmental one because the latter dilutes the 
effects of isolated events by averaging over several 
segments. A consequence is that segmental signal-to- 
dysperiodicity ratios correlate better than global ratios 
with perceived hoarseness. 

Average-equalizing the analysis windows removes 
most of the effects of low-frequency pop noise [2, 6]. A 
consequence is an increase of the correlation with 
perceived hoarseness for both segmental and global 
signal-to-dysperiodicity ratios. The increase is more 
marked for global than segmental signal-to-dysperiodicity 
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ratios because the former is more strongly affected by 
isolated large-amplitude events. 

Fig.2 is a scattergram that shows on the horizontal axis 
perceptual scores of hoarseness for sentence S3 and on 
the vertical axis the global signal-to-dysperiodicity ratios, 
computed by means of the average-equalized and un- 
equalized variograms. Generally speaking, the effect of 
equalizing frame averages in addition to frame energies is 
to improve the linearity between perceptual and acoustic 
cues and to increase the Pearson correlation coefficient, 
which is a measure of linear correspondence. 

One sees that the difference between the two analysis 
methods increases with the signal-to-dysperiodicity ratio. 
This is because when frame averages are not equalized 
the influence of low-frequency pop noise on the global 
signal-to-dysperiodicity ratio is stronger in clean signals. 


i Generalized variogram 
Av. equal. generalized variogram 


T 

keJ 

ox 1 J 

Q 

9 16b d 
14 rd | 

| $ | 
12 L 
10 9 4 
8 4 6 8 10 12 14 16 18 20 22 
Perceptual evaluation score 
Figure 2: Global signal-to-dysperiodicity ratio 


(vertical axis) versus perceptual scores (horizontal 
axis) and linear regression lines for sentence S3. 
Increasing scores to the right on the horizontal axis 
correspond to increasing scores of perceived 
hoarseness, that is, decreasing signal-to-dysperiodicity 
ratios. The black and white dots correspond to global 
signal-to-dysperiodicity ratios obtained via average- & 
energy-equalized and energy-equalized variograms 
respectively. 


B. Cycle duration jitter 


Fig.3 illustrates the extraction of cycle lengths via the 
analysis of peak saliences (sections II.C, I.D). The upper 
trace is the unfiltered speech signal, i.e. a fragment of 
vowel [a] sustained by a female hoarse speaker (the 
degree of hoarseness is 15 on a scale from 1 to 21). The 
voice is perceived as breathy rather than rough. The 
second graph shows the peak saliences of the low-pass 
filtered signal fragment and the bottom graph shows the 
cycle lengths extracted on the base of the cycle peak 
saliences and the inter-peak durations. 
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Figure 3: Fragment of vowel [a], peak saliences and 

cycle lengths. 
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Abstract: Acoustical properties of speech have been 
shown to be related to mental states such as 
depression and remission. In particular, energy in 
frequency bands has been used as features for group 
classification among the groups with mental states of 
remission, depression, and suicidal risk. The 
prediction algorithms presented develop an 
additional level of assessment and are designed to 
predict a score for the severity of the mental state, 
provided by the Beck Depression Inventory. Several 
multiple regression models have been produced 
relating the results of the inventory and the power in 
four frequency bands. Models were produced for 
both males and females using both spontaneous and 
automatic speech. 

Keywords: speech, mental states, power spectra, 
regression 


I. Introduction 


Methods to help to identify persons who are at 
elevated risk of suicide are sorely needed in clinical 
practice. This study represents an attempt to relate the 
frequency content in speech to the mental state of 
persons in two study groups: near-term suicidal and 
depressed. Vocal cues have been used as indicators in 
diagnosing the syndrome underlying a person’s 
abnormal behavior or emotional state by experienced 
clinicians [1], [2], but these skills are not in widespread 
clinical use. Considerable evidence suggests that 
emotional arousal produces changes in the speech 
production scheme by affecting the respiratory, 
phonatory, and articulatory processes that in turn are 
encoded in the acoustic signal [3], [4]. Certain changes 
in speech parameters may be specific to near-term 
suicidal states. Research has shown that depression has a 
major effect on the acoustic characteristics of voice as 
compared to normal controls. Prosody is slower and the 
energy in the speech is distributed differently over the 
frequency range between 0 and 2,000 Hz. 

In published pilot studies [1], [5], [6], analytical 
techniques have been developed to determine if subjects 
were in one of three mental states: healthy control, non- 
suicidal depressed, or high-risk suicidal. In particular 
power spectral density, PSD, features of vocal output 
characteristics have been found to be effective in 
differentiating among those mental states. Subsequent 
studies have shown that these features are effective in 
differentiating among remitted, depressed and suicidal 


speech in spontaneous speech and suicidal speech 
from depressed speech in automatic speech during 
reading. These results suggested that power spectral 
density analysis can be used to produce acoustical 
features for assessing suicide risk [7]. Because of this 
categorization accuracy, we hypothesized that these 
features could also be used to predict the severity of 
the mental state. For quantizing the mental state, a 
standard psychological assessment tool, the Beck 
Depression Inventory (BDI) was used [8]. This 
provides a numerical quantity from 0 to 64 with 0 
indicating a normal state and 64 indicating a high-risk 
for suicide. 


II. Methodology 


Database 

Recordings were obtained from males and females 
in two different patient groups; high-risk suicide and 
depression. Each study subject from each patient 
group had two types of speech samples recorded. 
They are speech samples from the interviews with a 
therapist, spontaneous speech, and the speech 
samples from reading the "Rainbow Passage", 
automatic speech. The passage is used in speech science 
because it contains all of the normal sounds in spoken 
English and it is phonetically balanced [9]. 

The recordings of the 13 female patients and 11 
male patients were obtained from ongoing study. The 
ages of the patients were between 25 and 65 years. 
All speech signals were sampled at a rate of 10 kHz. 
The background noise, long silent periods, and the 
voices other than the patient's voice were removed by 
using the GoldWave v.5.08 audio editor. The 
preprocessing is finished by dividing the edited 
continuous speech into 20-second segments. Two 
steps of preprocessing were used. First all edited 
speech was divided into 20 millisecond frames. All 
frames were tested for voicing and only voiced 
speech frames were kept. The voiced frames were 
concatenated into 20 second segments. Second, all 
speech segments were detrended and normalized to 
have a variance of | before analysis to compensate 
for possible differences in recording level among 
subjects. For each patient the length of the voiced 
interview speech was approximately 8 minutes and 
the reading speech was approximately 2 minutes. 

Each subject also completed the Beck Depression 
Inventory, BDI, [8]. This is a standard, brief, self- 
rated inventory used as a measure for mood. 
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Feature Extraction 

Power spectral densities (PSD's) of the voiced speech 
were obtained by using the Welch method with non— 
overlapping 100-point Hamming windows [1]. Four 
features were calculated in the four different frequency 
ranges: from 0 Hz to 500 Hz, 500 Hz to 1000 Hz, 1000 
Hz to 1500 Hz, and finally from 1500 Hz to 2000 Hz. 
For each of the 500 Hz sub-bands (x), x2, x3, x4), the 
percentages of the total power were calculated and 
stored. For each segment, the average power in each 
band over all frames was calculated and used as the 
feature set for that segment. 


Regression Analysis 

Because the percentages of power were used as 
features, only the power in the first three sub-bands can 
be independent and were used for analysis. The BDI and 
the acoustical features were stored in matrix form for 
regression analysis. The BDI is the dependent variable 
and the equation model is shown in equation 1. The BDI 
score is bdi(i) for subject i, the weighting coefficients, ax, 
and the sub-band energies, x/(i), x2(i), and x3(7) 


bdi(i) = a, + a,x, (i) + a,x, (i) 
+4,X, (1) + a,x, (i)x, (i) (1) 
+ other cross-products 


In order to choose the most appropriate model, 
Akaike’s information criterion (AIC, eqn. 3) was utilized 
[10]. The AIC measures of the goodness of fit of the 
estimated model, by not only minimizing the Residual 
Sum of Squares (RSS, eqn. 2) but also assessing a 
penalty for the number of free parameters, (k), for any 
number of measurement samples (n). 
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R? 
AIC=2k+n n( È) (3) 
n 


All combinations of models up to second order were 
determined for males and females and both speech 


types. 
III. Results 


Table I displays the number of patients in each 
group, the range of values of the BDI scores and the 
energy band ratios in each of the three groups. The 
acronyms are: ‘sc’ for the suicidal patients, ‘dp’ for 
the depressed patients, ‘rm’ for the remitted patients, 
and ‘PSD’ for the power spectral density values. The 
table shows that there is a quasi-definitive range for 
each energy band ratio. The Ist energy band ratio 
seems to fall in the range of 1 - .546, the 2nd energy 
band’s ratio is between .546 - .054, and the 3rd 
energy band ratio will approximately fall in the range 
of .054 - .001. As expected, the suicidal patients have 
a higher BDI score, whereas the depressed patients 
have mid-range BDI scores, and remitted patients 
have the lowest total BDI scores. However, the 
remitted and depressed patient’s BDI scores overlap 
somewhat. For the males, the overlap range is 9 to 16 
and for females, it is 18 to 21. There is also an 
overlap between the male reading depressed and 
suicidal scores. Notice from the number of patients in 
each group, the speech samples for both reading and 
interview sessions for everyone could not be always 
obtained. The overall results of the regression 
analysis are shown in Table II. 


Table I. Range of Data Values 
BDI Score (sc)[BDI Score (dp)|BDI Score (rm) PSD1 PSD2 PSD3 


20 - 57 9234: | =. | 60 - 0.9410.054 - 0.468|0.003 - 0.044 
40 - 57 | 9-30 | 0-16 Jo 60 - 0.9410.058 - 0.384/0.002 - 0.054 


Female 28-51 18 -38 | 0-21 [0.65 0.9610.038 - 0.335|0.001 - 0.044 


Female [Intervie 


34 - 51 18 - 38 | 0-21 [0.61 0.97|0.031 - 0.361|0.001 - 0.051 


Neurological dysfunctions 


155 


Table II. Final Model Hypotheses for Four Groups 


Gende Session 


Male Reading 


Model Hypothesis (Multiple Linear Equation) 
y= 21 - 36x1 + 13x2 + 4250x3 - 80406x3? - 3946x2x3 


it oe y = -4030 + 4050x1 + 4120x2 + 193370x3 - 187970x1x3 - 192000x2x3 - 237010x3? 
y = 8926 - 196261x1 - 15468x2 + 2705x3 + 10743x17 + 6710x27 + 17286x1x2 - 
Female | Reading | 13213x2x3 
Intervie | y = 26240 - 57148x1 - 38456x2 - 35499x3 + 30935x1? + 12338x2? + 43308x1x2 + 
Female | w 43956x1x3 


All regression models have a full linear component 
and some quadratic and cross-product terms. For the 
males, the squared 3rd energy band and the cross- 
product of the 2nd and 3rd energy bands are also 
important in both the reading and interview groups. The 
interview group has an additional cross-product term 
involving the 1% and 3™ energy bands. For the females, 
the squared and cross-product of the 1st and 2nd energy 
bands are also important in both the reading and 
interviewed groups. The interview group also has an 
additional cross-product term between the 1“ and 3" 
energy bands. None of the models have the same 
coefficients for the linear terms. The R? values for the 
four groups are listed in Table III. They range from 0.25 
to 0.50. These indicate significant models. 


Table III. R? Values 


Gender | Session R° 
Values 


Male Intervie | 0.50 
Ww 


Female | Intervie | 0.41 
Ww 


IV. Discussion and Conclusion 


The percentage of power in the frequency bands 
appears to be significant predictors of mental state. 
Spontaneous speech tends to be modeled better. The first 
frequency band contains the dominant amount of energy. 
Three of the four models have the coefficient for the 
linear term of the first band being negative. This is 
consistent with background material that has been 
showing that energy below 500 Hz increases during 
depression. The spontaneous speech for males does not 
follow this concept. The conundrum is that none of the 
models are the same. 

The average range of BDI scores for each mood 
class was determined from Table I and is shown in Table 
IV. For any future patient, the power spectral densities 
can be extracted from an audio recording of their 
interview or reading of the Rainbow Passage. However, 
there are overlap regions and thus there’s an ambiguity 


between the mood classes, and more information 
about the patient’s history may be needed. The first 
three power spectral densities (energy band ratios) 
could be integrated into a more extensive model of 
the patient’s respective group, and the BDI score 
would be estimated. The clinician could thus 
determine the patient’s mood or state of mind by 
comparing the estimated BDI score with Table IV. 

However, the inconsistencies cited above need to 
be clarified by a larger and more expanded database. 
Persons who are in the normal and in the remitted 
depressive category need to be considered because 
they, in general, have lower BDI scores. Also more 
patients in each category need to be measured to 
improve the statistical reliability. 


Table IV. Final Range for Mood Classes 


Mood Total BDI Score 
Class Range 
Suicidal 30 — 64 
Depressed 14-35 
Remitted 0-18 
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DISTINGUISHING HIGH RISK SUICIDAL SUBJECTS AMONG DEPRESSED 
SUBJECTS USING MEL - FREQUENCY CEPSTRUM COEFFICIENTS AND 
CROSS VALIDATION TECHNIQUE 
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Abstract: This paper describes a way to distinguish 
high risk suicidal patients among depressed patients 
using mel-frequency cepstrum coefficients. 
Distinguishing high risk suicidal patients among 
depressed patients is an important problem; a 
practical solution to this would prevent the loss of 
many lives. In this study, the vocal characteristics of 
female and male patients’ speech samples were 
analyzed and a small subset of the first ten mel- 
frequency cepstrum coefficients were used to classify 
high risk suicidal patients and depressed patients. 
Cross validation was used to observe classification 
performance. There were two different types of 
speech samples from both male and female patients. 
One of them was the speech sampled during a clinical 
interview and the other was speech sampled during a 
text-reading session. 


Keywords: Speech, MFCC, suicide, depression, cross 
validation 


I. INTRODUCTION 


It is reported [1] that mental disorders are very 
common in the United States and internationally. 
Twenty-six percent of Americans, 18 or older, carried a 
mental disorder in 2005. In the same year, major 
depressive disorder affected 6.7 percent of the U.S. 
population. [1] More than 90 percent of the people who 
committed suicide had a diagnosable mental disorder, 
most commonly a depressive disorder [2]; so, there is an 
important relationship between depression and suicide. 

As can be seen from these statistics, suicide is an 
important public health problem and has a strong 
relationship with depression. Therefore, it is very 
important to evaluate a depressed patient's risk of 
committing suicide. Psychiatrists evaluate this risk using 
clinical interviews and rating scales, such as the Hamilton 
depression rating scale. [3] Additionally, it is known that 
psychological states affect a person's speech production 
system. It was proposed by S. E. Silverman that vocal 
parameters of human speech could assist in recognizing 
and then assessing suicide risk. [4] 

Some researchers have studied the relationship 
between vocal tract characteristics and suicidal risk. 


Tolkmitt et al. compared the formant information of 
vowels that occurs in the identical phonetic context 
during the patient's recovery period. [5] France et al. 
observed long term averages of the formant information 
and found that they were able to distinguish high risk 
suicidal patients from depressed and control patient 
groups. [6] Yingthawornsuk et al. used the percentages of 
the total power, its highest peak value and its frequency 
location to distinguish between high risk suicidal, 
depressed and remitted (had been depressed previously 
but recovered) groups.[7] In another study, 
Yingthawornsuk et al. used the spectral energy and the 
GMM based feature of the vocal tract system response for 
separating two groups of female patients carrying a 
diagnosis of depression and suicidal risk.[8] Kaymaz 
Keskinpala et al. used both energy in frequency bands, 
and first eight mel-cepstral coefficients to distinguish 
between high risk suicidal and depressed patients. [9] 
Ozdas used lower order mel-cepstral coefficients to 
distinguish high risk suicidal patients from non-suicidal 
ones using Gaussian mixture models and unimodal 
Gaussian models. [10] 

Mel-frequency cepstral coefficients are useful 
parameters that have been used in many speech 
processing systems, such as in [10]. Logan proposed 
using mel-frequency cepstral coefficients for modeling 
music. [11] Godino-Llorente et al. used short term mel- 
cepstral parameters for pathological voice quality 
assessment. [12] Choi worked on compensating the mel- 
frequency cepstral coefficients for speech recognition in 
noisy environments. [13] 

This paper presents work on distinguishing high risk 
suicidal patients from depressed patients using a small 
subset of the first ten mel-frequency cepstral coefficients 
for female and male patients. Cross validation was used 
to estimate the classification performance. The optimal 
mel- frequency cepstrum coefficients are found for 
female and male patients and for both the reading and 
interview sessions of each gender. 


II. METHODOLOGY 


A. Database 
A.1. Information about the Database 

The database for this research is obtained from an 
ongoing study within the Department of Psychiatry at 
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Vanderbilt University School of Medicine and supported 
by the American Foundation for Suicide Prevention. The 
study and consent process was developed in collaboration 
with, and approved by the Vanderbilt University 
Institutional Review Board. The database is composed of 
recordings from male and female subjects whose ages are 
between 25 and 65 years of age. Psychiatric clinicians, 
not involved in this study, categorized these patients as 
depressed, and with or without high risk of suicide, and 
referred them to research personnel for consent 
procedures, diagnostic confirmation and a brief 
recording. The number of the female patients and male 
patients that are used in this study is shown in Table 1. 


Table 1. Female and Male Patient Database 


Female / Male Interview | Reading 
Depressed 18 /11 16/14 
High-Risk 

Suicidal 11/9 9/9 


The database contains two different types of speech 
samples. One sample type was recorded while the patient 
was interviewed by a physician or highly-trained research 
assistant. This type of speech sample was named the 
"Interview Session". The other one is named the 
"Reading Session" and was recorded while the patient 
read predetermined part of a book. Quiet, closed rooms in 
clinical settings provided the recording environments. 


A.2. Preprocessing 

All speech signals were digitized by using a 16 — bit 
analog to digital converter at a sampling rate of 10 kHz 
with an anti — aliasing filter. GoldWave v.5.08 audio 
editor was used to remove the silences which are longer 
than 0.5 seconds and the voices that is not belong to the 
patient. In this study, 76 seconds of each female patient's 
continuous speech from both interview and reading 
sessions were stored for analysis. For male patients, 66 
seconds of continuous speech were stored. All stored 
speech signals underwent detection for voiced and 
unvoiced speech segments. Only voiced segments were 
used for subsequent analysis. 


B. Feature Extraction 


The features used for the analysis were a small subset 
of the first ten mel — frequency cepstrum coefficients in 
each patient's speech sample. 

Each speech signal was divided into 512 points of voiced 
segments. For each voiced speech segment the log — 
magnitude spectrum was computed from discrete Fourier 
transform (DFT). The spectrum was then filtered by a 
series of 16 triangular band- pass filters. The filter bank 
that is used in this work is similar to that was employed 
by Davis and Mermelstein [14] which simulates the 
critical band filtering by a set of triangular band-pass 
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filters. The bandwidths and center frequencies of these 
filters are chosen according to the mel - scale. 

The human ear is more sensitive to changes in the low 
frequency portion of the frequency spectrum. [15] Thus, 
the mel — scale was formulated for the sampling of the 
frequency spectrum based on this property of human 
auditory perception. The linear frequency spectrum was 
mapped based on the human auditory perception with 
mapping approximately linear on the 0 — 1 kHz range and 
logarithmic above | kHz. The following formula is the 
suggested formula that models this relationship in which 
Fmel is the perceived frequency and Fy, is the actual 
frequency. 


(1) 


F a 25951080 


14 Fyz 
700. 
Vocal tract length normalization was performed for 
each patient. The bandwidths and center frequencies of 
the filters in the mel — scale filter bank were then adjusted 
according to this normalization factor. [16] The last step 
is to calculate the inverse discrete Fourier transform 
(DFT) to obtain the  mel-frequency  cepstrum 
coefficients. The procedure is shown in Fig. 1 below. 


sssr 


Fig. 1. Feature extraction procedure. 


After the first ten mel-frequency cepstrum 
coefficients were calculated for each frame, the values in 
all frames are averaged to have one value for each mel- 
frequency cepstrum coefficient for each patient. 


C. Cross — Validation Classification 


The k — fold cross validation technique [17] with 
quadratic discriminant function was performed on the mel 
— frequency cepstrum coefficients data. The data files 
were split randomly into two subsets. One set is for 
training the data and the other is for testing the data. 
Sixty-five percent of the data was used to train the data 
for estimating the quadratic classification function. Then 
using this quadratic classification function, 35% of the 
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data was tested by performing the classification. The 
variance of the performance estimates was reduced by 
averaging the results from 10 different runs of cross 
validation. 

A simple approach is used to seek sub-optimal 
combinations of one, two, and three coefficients for 
classifications. The cross validation procedure is 
performed for each mel — frequency cepstrum coefficient 
separately. The cepstral coefficient that gives the 
maximum classification result is determined first. Next, 
this cepstral coefficient is paired with all the other 
cepstral coefficients and cross validation classification is 
performed again. The resulting pair of cepstral 
coefficients that gave the maximum classification result is 
determined. The same process is repeated for three 
cepstral coefficients that gave the maximum 
classification. Three classification performances (one 
coefficient performance, two coefficients performance, 
and three coefficients performance) are then compared 
and then the set giving the best performance is assigned 
as the optimal coefficients. 

This performance testing is performed for three 
criteria: determining only the maximum depressed 
classification, and then only for the maximum high risk 
suicidal classification, and finally for the maximum total 


classification of depressed and high risk suicidal 
classification. 
HI. RESULTS 
The depressed- high risk suicidal pairwise 


classification using k-fold cross validation technique was 
performed for finding the optimal coefficient(s) that gave 
the maximum classification performance. The results for 
the male interview and reading sessions are shown in 
Table 3 and Table 4, respectively. 


Table 3. Male Interview Session's Classification Results 
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The optimal features for depressed classification are 
coefficients 1 and 4 for the interview session with a 
classification performance of 78.60%; on the other hand 
optimal features are coefficients 2, 9 and 1 for the reading 
session with a classification performance of 89.80%. 

Coefficient 3 is the optimal feature for both high risk 
suicidal classification and total classification of depressed 
— high risk suicidal with a classification performance of 
86% and 77.20% respectively in the interview data. 

The optimal feature for both high risk suicidal 
classification and total classification of depressed — high 
risk suicidal classification of the reading session is 
coefficient 2. The classification performance was 93% for 
the high risk suicidal classification and 78% for the total 
classification of depressed — high risk suicidal 
classification. 


Table 5.Female Interview Session's Classification Results 


Optimal Classification 
Coefficient(s) | Performance 
Only Coefficients 
Depressed 1,5,and7 78.90% 
Only High 
Risk Suicidal Coefficient 9 70.10% 
Total 
Classification | Coefficient 9 66.40% 


Table 5, shows the results for the female interview 
session. The optimal features for depressed classification 
are coefficient 1, 5 and 7 with a classification 
performance of 78.90%; on the other hand the optimal 
feature is coefficient 9 for both high risk suicidal 
classification and total classification of depressed — high 
risk suicidal classification with a classification 
performance of 70.10% and 66.40% respectively. 


Table 6. Female Reading Session's Classification Results 


Optimal Classification 
Coefficient(s) | Performance 
Only Coefficients 
Depressed l and 4 78.60% 
Only High 
Risk Suicidal Coefficient 3 86.00% 
Total 
Classification Coefficient 3 77.20% 


Table 4. Male Reading Session's Classification Results 


Optimal Classification 
Coefficient(s) | Performance 
Only Coefficients 
Depressed 2,9 and 1 89.80% 
Only High 
Risk Suicidal Coefficient 2 93.00% 
Total 
Classification | Coefficient 2 78.00% 


Optimal Classification 
Coefficient(s) Performance 

Only Coefficients 

Depressed 3, and 2 70.10% 

Only High 

Risk Suicidal Coefficient 8 71.10% 

Total 

Classification Coefficient 9 63.90% 

Table 6 presents the female reading session 


classification results and optimal features. The optimal 
features for depressed classification are coefficients 3 and 
2 with a classification performance of 70.10%. For high 
risk suicidal classification, the optimal feature is 
coefficient 8 with a classification performance of 71.10%. 
Coefficient 9 is the optimal coefficient for the total 
classification of depressed — high risk suicidal 
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classification with a classification performance of 
63.90%. 


IV. DISCUSSION AND CONCLUSIONS 


This paper demonstrates that mel-frequency cepstrum 
coefficients are a good indicator for discriminating 
between depressed patients at high- and low- risk of 
suicidal behavior. Male and female patients were 
analyzed separately. The mel-frequency cepstrum 
coefficients discriminated among the depressed patients, 
with matching of the vocal to clinical assessment with a 
performance better than 70%. 

The controlled text-reading tended to give better 
results for male subjects especially for high risk suicidal 
classification and depressed classification. The total 
classification was about the same for both reading session 
and interview sessions. 

The maximum classification results that are obtained 
from the male subjects are noticeably better than the 
female subjects' results. 

These findings may be limited by several factors, 
including the imperfections of the recording 
environments, the reliance on clinical assessments (by 
non-research as well as research diagnosticians) for a 
reference standard, and the variable timing of recordings 
relative to peak intensities of suicidal risk. 

Never-the-less, the findings may ultimately be 
applicable to the development of clinically practical 
instruments for detecting vocal stress that could indicate a 
need for increased attention to suicidal risk assessment. 
These findings, along with other findings in the literature, 
indicate that feedback and feed-forward regulatory 
pathways for speech production are impaired in 
depression. Identifiable and quantifiable alterations in 
these pathways may provide needed paradigms for the 
study of the pathophysiology of depression. 
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RECORDING SPEECH 
DURING MAGNETIC RESONANCE IMAGING 


T. Lukkarit, J. Malinen!*, P. Palo! 
Institute of Mathematics, Helsinki University of Technology, Helsinki, Finland 


Abstract: We discuss recording arrangements 
for speech during an MRI scan of the speak- 
ers vocal tract. The image and sound data thus 
obtained will be used for construction and vali- 
dation of a numerical model for the vocal tract. 
Keywords: Speech recording, MRI 


I. INTRODUCTION 


This article reports progress in development of a 
FEM-based numerical simulator for Finnish vowels. 
To obtain the anatomical geometry and to validate 
the model, we need formants from a speech signal 
that is recorded simultaneously with an MRI scan. 

Magnetic resonance imaging (MRI) has been used 
for imaging the vocal tract for a long time [1]. Nowa- 
days, the scanning can be carried out in well under 
30 s [3]. The anatomical data produced by MRI is 
suitable for generating the computational mesh for 
the finite element method (FEM). FEM solvers for 
the wave equation have been used for simulating nor- 
mal speech production acoustics [4; 5; 9], the effects of 
anatomical abnormalities, and oral and maxillofacial 
surgery on speech [2; 6; 8]. 

We shall carry out the imaging using a Siemens 
Magnetom Avanto 1.5 T machine. The environment 
in an MRI room is challenging from the viewpoint of 
sound recording. There is a static 1.5 T magnetic field 
within the MRI coil, and even the ambient field may 
be considerable. An imaging sequence produces an 
electromagnetic field at = 64 MHz since the Larmor 
frequency of protons is 42.58 MHz/T. The peak power 
may reach several kilowatts. To further complicate 
things, there is acoustic noise of about 90 dB (SPL) 
over a range of frequencies that inconveniently overlap 
the expected formants. 

The noise prevents the subject from hearing 
her/his own voice during the scan. Thus, the de- 
noised, undelayed signal should be fed back into the 
subject’s ear phones to improve speech naturality. As 
the experiments involve a human subject, safety and 
comfort must be taken into account. 

Roughly speaking, the task is to separate a plane 
wave (i.e., the speech) from a cylindrically symmet- 


ric noise source (i.e., the environment), while paying 
attention to the complications described above. 


II. SPECIFICATIONS AND DESIGN 


Because of the magnetic field, only negligible 
amounts of ferromagnetic material may be used in the 
experimental apparatus inside the MRI room. None 
at all is allowed in the sound collector within the MRI 
coil. All electronics inside the MRI room have to be 
shielded against overvoltage and radio frequencies. Of 
course, closed loops in all conducting material must 
be strictly avoided. 


A. Sound collector and acoustic wave guides 


A two-channel sound collector will be used, one 
channel for the speech and the other for the noise. 
The dimensions of the collector must be small com- 
pared to the formant wavelengths, and the collector 
must fit inside the MRI equipment. 


Figure 1: Acoustic wave guides and 
their suspension arrangement 


The sound signals are transmitted to a microphone 
assembly by acoustic wave guides (see Fig. 1). They 
are constructed from soft PVC tube of inner diame- 
ter 9 mm. The length of each wave guide is 3.0 m, 
and they are suspended pairwise so as to cancel out 
external disturbances. 

The medium in the collector and the wave guides 
is air. Sound transmission in the wave guide walls 
appears to be negligible. The frequency response of 
the acoustic wave guide between 0.42-3.3 kHz is given 
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Figure 2: Frequency response of the wave guide 


in Fig. 2. At lower frequencies in Fig. 2, longitudinal 
resonances of the wave guide appear. Below 1.5 kHz, 
there is = 4dB attenuation per octave that can be 
easily compensated by, e.g., an RC filter. 


B. Shielded microphone assembly and cabling 


The microphone assembly is enclosed in a Fara- 
day cage. The cage is made of 6 mm aluminium 
plate, which is thick enough not to buckle or resonate. 
Damping material can be used inside the cage, if nec- 
essary. The acoustic wave guides are brought into the 
cage through electromagnetic wave guides, designed 
to be opaque at frequences between 10-100 MHz. 

The microphone assembly (see Fig. 3) consists of 
four Panasonic WM-62 condenser microphones (with 
sensitivity -45 + 4 dB re 1V/Pa at 1 kHz, © 9 mm) 
and a power source for them. The nominal frequency 
response of the microphones, as given by the manufac- 
turer’s data sheet, is essentially flat in the frequency 
range of interest. By a superficial measurement, sen- 
sitivities and frequency responses of such microphone 
units do not seem to differ from each other signifi- 
cantly, and hence we omitted more detailed calibra- 
tion measurements. 

The microphones are embedded into a plate that 
is acoustically and electrically isolated from the 
walls of the Faraday cage. The sound waves enter 
the microphones through simple, adjustable acoustic 
impedance matchings (see the lower right corner in 
Fig. 3). These matchings are tuned experimentally 
by closing some of the holes (@ 2 mm) in the walls 
of the tubes. Tuning is carried out in order to mini- 
mize the reflected wave from the microphone assem- 
bly, analogously to the termination of usual electric 
trasmission lines. This results in a partial supression 
of the longitudinal resonances of the wave guide (see 
Fig. 2). 
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Figure 3: Microphone assembly 


An energy dissipation of several dB’s is seen in 
the frequency response of the system, depending on 
the number of impedance matching holes that have 
been closed. Since the matching consists of both open 
and closed partial terminations of the wave guide, the 
residual reflection takes place both with and without 
phase inversion. We remark that this corresponds ex- 
actly to the number of measured peaks in Fig. 2. 

The signals are transmitted from the MRI room by 
two microphone cables (Tasker C116 4x0.14-26AWG); 
two channels in each. All cable endings are shielded 
against overvoltages by diodes. Since only two chan- 
nels are used by the sound collector, the remaining 
microphones are a reserve. 


C. De-noising amplifier and CMRR curves 


The test subject needs to hear the de-noised sig- 
nal in real time. Hence, we implement the de-noising 
system as an analog device. It is a summing amplifier 
(see Fig. 4) with one direct channel (for the signal) 
and three adjustable, inverted channels (for subtract- 
ing up to three noise signals). Before recording, the 
summing coefficients are adjusted manually by listen- 
ing to the output. The device is constructed using six 
LM741's, and its input impedance is 3 kQ. 

The frequency response of the amplifier is flat be- 
tween 0.2-5 kHz. Its optimal common mode rejection 
ratio (CMRR) between 0.42-3.3 kHz is given by the 
lowest, quite smooth curve in Fig. 5. This CMRR can 
be improved by reducing tolerances of the electrolyte 
capacitors in the amplifier. 

The upmost, rather rough-looking curve in Fig. 5 
is the measured CMRR of the whole system. This 
includes the wave guides and the acoustic impedance 
matchings at the ends of the wave guides. The differ- 
ence between the two curves in Fig. 5 is mostly due 
to the physical properties of the wave guides and — 
unfortunately — the poor quality of the sound source 
used in the measurements. 
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Figure 4: De-noising amplifier 


D. Computer equipment and signal processing 


The de-noised signal from the amplifier is digitized 
using a MacBookPro2,2 computer running MacOSX 
10.4.9. The required signal processing and formant 
extraction will be done using Matlab 7.4, Signal Pro- 
cessing Toolbox, and custom made code. In particu- 
lar, the longitudinal resonances visible in Fig. 2 can be 
compensated with Matlab. We remark that the fre- 
quency response must be remeasured in the final ex- 
perimental setting since bending the wave guide will 
move the resonance frequencies [7]. 


III. MEASUREMENTS 


We proceed to explain in detail how the data in 
Figs. 2 and 5 was obtained. 


A. Arrangement and equipment 


A sine wave generator (Taylor 192A) was coupled 
to a two-channel sound source (see Fig. 6), and the 
sound pressure at the source was manually kept at a 
constant level 94 dB (SPL) for frequencies between 
0.42-3.3 kHz. This was accomplished by measuring 
the reference microphones inside the sound source us- 
ing an analog volt meter (Heathkit V-7 A) through 
a microphone preamplifier (Resound CVS908). An 
oscilloscope was used to detect possible distortion vi- 
sually. 

The produced sound signals were fed to the mi- 
crophone assembly (see Fig. 3) through the wave 
guides (see Fig. 1). The wave guides were completely 
straightened out during the measurements, and the 
surrounding acoustical noise was controlled by vari- 
ous means. 

From the microphone assembly, the two signals 
were brought to the direct and inverted channels of 
the de-noising amplifier. The amplification of the di- 
rect channel was set to 45dB. The amplification of the 
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Figure 5: Optimal CMRR curves of the amplifier 
(lowest) and the acoustic wave guides (upmost) 


inverted channel was set so that the output of the am- 
plifier was at its minimum when a 1 kHz sine signal 
was fed to both the direct and inverted channels. 

All data for Figs. 2 and 5 was measured using a 
second analog volt meter (Goerz Unigor 226221) at 
the output of the de-noising amplifier. At all mea- 
sured frequencies, readings were taken both with and 
without the inverted channel coupled. 


B. Sound Source 


Consider the measurements described above. An 
ideal sound source for such measurements should be 
able to produce two sine wave sound signals of the 
same amplitude and without phase difference. Both 
the channels should be acoustically uncoupled, and 
their acoustic impedances should be the same. All 
this should be accomplished without distortion, over 
a wide range of frequencies and sound pressure am- 
plitudes. 


Figure 6: Disassembled sound source 


Our design (see Fig. 6) consists of a loudspeaker 
(2 50 mm, impedance 8 Q), together with a symmet- 
ric cavity that divides the pressure field to two chan- 
nels. There is a reference microphone of type Pana- 
sonic WM-62 embedded in the walls of each channel. 
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The sound source also includes a rudimentary 
acoustic impedance matching of the same type as used 
in the microphone assembly. Its purpose is to mimic 
qualitatively the impedance of the real sound collec- 
tor that will be used inside the MRI equipment. We 
remark that the acoustic impedances of the sound col- 
lector and the sound source are different, which has a 
quantitative effect on a frequency response curve like 
Fig. 2. 

The sound source suffers from the resonances of 
both the cavity and the loudspeaker itself. Near such 
a resonance, the produced sound signals are out of 
phase, and the results of the CMRR measurement 
are worse than the true CMRR would be. To reduce 
a particularly inconvenient resonance at = 1.7 kHz, a 
horn made of copper plate, on the right in Fig. 6, had 
to be placed between the loudspeaker and the cavity. 
We could not obtain the CMRR data for high frequen- 
cies, since the cavity becomes resonant at = 3.5 kHz. 
On the other hand, frequencies under 0.4 kHz must 
be produced without the horn in place, since the horn 
distorts the signal at lower frequencies. 

The peaks at 1.7 kHz, 2.85 kHz, and 3.3 kHz in 
the upmost CMRR curve in Fig. 5 are at least partly 
explained by a phase difference of the sound source 
channels. These phase differences were verified by 
an oscilloscope Lissajous measurement. However, the 
peak at 1.95 kHz is not due to phase difference. 

Above 2 kHz, the channels of the sound source be- 
gin to drift out of balance because the loudspeaker is 
not symmetric. When this lack of balance was com- 
pensated by readjusting the de-noising amplifier, we 
obtained a much better CMRR curve for 2.6-3.3 kHz 
that has been plotted in Fig. 5, too. 

We conclude that the true CMRR for the wave 
guides is significantly better for high frequencies than 
what Fig. 5 would indicate. The design and construc- 
tion of a good quality, multi-channel sound source re- 
mains a challenging exercise in acoustic engineering. 


IV. CONCLUSIONS 


We have described noise cancellation, sound trans- 
mission, and recording techniques through acoustic 
wave guides in difficult environments such as the MRI 
room. 

The acoustic wave guides change sound quality; 
speech becomes somewhat crisp or even hoarse. How- 
ever, speech remains easily understandable without 
numerical compensation of the wave guide resonances. 
As a conclusion, we expect to obtain good quality 
recordings of many types of speech signals from which, 
e.g., successful formant extraction should be possible. 
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Abstract: The Modified A-Space method is described. 
It allows the detailed characterization of in terms of 
mid-sagittal-plane area, antero-posterior distance, 
occlusal plane area, posterior pharynx wall tilt, 
mandible arch width, and oral cavity volume. 
Keywords : Speech Production, Articulatory oral 
space measures 


I. INTRODUCTION 


The X-ray microbeam method for measurement of 
articulatory dynamics has been used to acquire large 
amounts of data, with reduced X-ray dosage, resulting in 
one of the most widely used freely available speech 
production databases. The X-ray Microbeam Speech 
Production Database (XRMB-SPD), developed at 
Wisconsin University, USA, includes a vast amount of 
coordinate data describing articulatory movements, and 
acoustic and electroglotographic data collected 
simultaneously [3]. Honda et al. [2] examined the 
geometry of the vocal track of American English and 
Japanese speakers from the XRMB-SPD, using a 
quadrilateral (A-Space) limited by the palate plane, the 
anterior nasal spine-menton line, the outline of the 
posterior pharyngeal wall, and a line parallel to the 
palatal plane, passing through the menton and extending 
to the pharyngeal wall. In this study the A-Space of 
different speakers varied in shape. The vowel 
articulations adapted to the form of the A-Space whilst 
consonant articulations were independent. 

The Modified A-Space method was used to select 4 
speakers in a study that relates occlusal classes with 
vowel, fricative and stop production adaptations [1]. It 
allows the detailed characterization of the XRMB-SPD 
speakers not just in terms of mid-sagittal-plane area, but 
also in terms of antero-posterior distance, occlusal plane 
area, posterior pharynx wall tilt, mandible arch width, and 
oral cavity volume. This last measure has proven to be far 
more reliable and has revealed more speaker dependent 
characteristics than the measure previously proposed in 


[2]. 
II. METHODS 
XRMB-SPD provides occlusion classification, dental 


measures, anthropomorphic measures, reference pellets 
coordinates, biteplate records and palatal outlines, for 


each of the 57 speakers. This was used to measure the 
articulatory oral space (AOS) in the absence of 
cephalometric analysis, based on the Modified A-Space 
described in Fig. 1. 


i Axis 
[E] Reference electrodes e 
[EX Mobile electrodes 47 | 
[= Static articulatory coordinates TNI 
Static mandible coordinates LAO) marxa 
PT 
‘A; 
A ( (Quo 
J ' \ AS Mid-sagital 


plane 


MaxOP TA A | 
Maxillary | 
‘occlusal plane na 


Fig. 1: Top — Mid-sagittal-plane coordinates included 
in the XRMB-SPD (MAXn and MAXg — bridge of the 
nose; MAXi — buccal surface of the maxillary incisors; 

MANm- juncture between the first and second 
mandibular molars; MANI — buccal surface of the central 
incisors; LL — lower lip; UL — upper lip; PAL — palate; 
PHA — middle pharynx wall; CON — condyle; COR — 
coronoid process; GON — gonion; GNA — gnathion; MNI 
— lingual surface of the maxillary incisors). Bottom — a 
three dimensional representation of the maxillary arch 
and mid-sagittal palate height of the anterior oral cavity 
(from the distal-buccal cusp tip of the second molar to the 
lips). The Modified A-Space measures M1, M2, M3, M4 
and MS, are also represented. 
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We extracted the following measures of the AOS, as 
shown in Fig. 1 and 2: MI — antero-posterior distance, 
calculated from the upper incisors to the posterior 
pharynx wall; M2 — mid-sagittal plane area, from the 
mandible to the palate midline; M3 — occlusal plane area, 
from the distal-buccal cusp tip of the second molar to the 
lips; M4 — posterior pharynx wall tilt, i.e, the angle 
between the pharynx and the occlusal planes; M5 — 
mandible arch angle, calculated with several mandible 
points; M6 — anterior oral cavity volume. Areas of 
trapeziums (A1, A2, A3, A5 and A6) and a triangle (A4), 
and volumes of convex hulls of cubes and tetrahedrons 
were used to estimate the AOS, as shown in Fig. 2. 


Mid-sagittal plane 


Occlusal plane 


Fig.2 Measures M2, M3, M4, MS and Oral Cavity 
Volume for speaker JW15, showing the mid-sagittal and 
occlusal planes (top) and half of the oral cavity volume as 
reconstructed using the Modified A-Space (bottom). 


HI. RESULTS AND DISCUSSION 


Results showed a considerably larger average oral 
cavity volume and greater antero-posterior distance AOS 
in male subjects than in females, as shown in Fig. 3. 

The detailed characterization of the XRMB-SPD 
speakers, shown in Fig. 4 to 7, revealed great variability. 
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Fig. 3 M1, M2, M3 and M6 measures of the available 18 
Class I male and 22 Class I female speakers. 


IV. CONCLUSION 


The Modified A-Space provided additional 
information, allowing the characterization of cranio-facial 
features and the selection of a uniform set of speakers in 
studies [1] involving XRMB-SPD. This method combines 
anatomical data and biomedical signals producing a 
reference dataset for research into speech production. We 
believe that this method may provide additional 
information to regular cephalometric analysis. 
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Fig. 4 M1 and M2 measures of Class I speakers. Fig. 5 M1 and M2 measures of Class II speakers. 
Numbers in the x-axis represent the actual XRMB-SPD Numbers in the x-axis represent the actual XRMB-SPD 
speaker identification. speaker identification. 
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Fig. 6 M3 and M6 measures of Class I speakers. Fig. 7 M3 and M6 measures of Class II speakers. 
Numbers in the x-axis represent the actual XRMB-SPD Numbers in the x-axis represent the actual XRMB-SPD 
: ; . speaker identification. 
speaker identification. 
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Abstract: Mucosal waves have been considered of 
crucial importance for healthy vocal fold vibration, 
but their appearance and variability have been 
described only vaguely so far. We studied the 
appearance of the mucosal waves using 
videokymography. The mucosal wave was divided in 
two components: 1) the vertical phase differences 
between the lower and upper margins of the vocal 
folds, which are reflected in sharpness of the lateral 
peaks in kymograms and 2) the waves propagating 
laterally over the vocal fold surface which can be 
recognized as lateral movements on the vocal fold 
surface occurring during medial movement of the 
glottal edge. Different features of the laterally 
traveling mucosal waves were recognized. The 
suggested new classification of the mucosal wave 
properties opens new possibilities for more sensitive 
monitoring of the state of the vocal fold tissues in 
basic voice research and clinical practice. 


Keywords : Mucosal waves, videokymography, high- 
speed videolaryngoscopy, vocal fold vibration 


I. INTRODUCTION 


Mucosal waves on the vocal folds have been considered 
of crucial importance for healthy vocal fold vibration. 
Their presence (absence) reflects on the pliability of the 
vocal fold mucosa. Mucosal waves are one of the basic 
laryngeal features evaluated routinely by clinicians using 
strobolaryngoscopy and used for diagnosis of voice 
disorders [1]. The waves are known to travel upwards 
along the medial vocal fold surface and then continue 
laterally over the upper surface of the vocal folds. Their 
appearance and variability has been described only 
vaguely so far, however. The purpose of the present study 
was to determine the basic features of the mucosal waves 
in order to allow their better specification and more 
sensitive evaluation using videokymography. 


II. METHODS 


More than 7,000 VKG examinations of patients with 
various types of voice disorders were performed and 
recorded at the Center for Communication Disorders, 
Medical Healthcom, Ltd, in Prague from 1996 to 2006. 
The details of the equipment were described elsewhere 
[2,3]. The VKG examinations were always preceded by 
strobovideolaryngoscopy. The subjects’ VKG 
examinations were usually around 1 to 5 minutes in 
duration and contained approximately 3,000 to 15,000 
VKG images, ie, consecutive video fields of 18.4-ms 
duration. Of the 7,000 patient examinations, only about 
20% were processed due to time constraints. 

The processing involved field-by-field viewing of the 
videotape recordings and a search for images with good 
focus, illumination, and contrast that showed clear 
vibration patterns at the locations of interest on the vocal 
folds. These images were digitized with the video board 
Miro PCTV. From these, 1 or more VKG images were 
selected by the examiner as the most representative for 
the subject and were then combined (with Corel Photo 
Paint software) with corresponding laryngoscopic and 
laryngostroboscopic images into a final set of images for 
the patient record [2-4]. 

For the purposes of the present study, 100 VKG images 
of sustained phonations from 45 subjects were 
retrospectively selected from these patient records. The 
images were selected so that they covered the widest 
possible spectrum of vocal fold behavior. The images 
were compared among themselves, and the differences in 
the appearance of the mucosal waves were analyzed 
visually. 


III. RESULTS 


Two components of the mucosal waves were 
distinguished: 1) the vertical phase differences between 
the lower and upper margins of the vocal folds, and 2) the 
continuing waves propagating laterally over the vocal 
fold surface. The vertical phase differences were found 
encoded in the videokymographic images in the 
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sharpness of the lateral peaks of the vocal-fold waveform 
contour. The laterally traveling mucosal waves were 
defined as lateral movements on the vocal fold surface 
occurring during medial movement of the glottal edge. 
Based on this new definition, we found four basic 
features which distinguished various types of laterally 
traveling mucosal waves: a) lateral extent, b) 
enhancement/light reflection, c) spatial separation from 
the vocal fold margin and d) delay in appearance after 
vocal fold peak displacement. These four features are 
considered to reflect different mucosal and geometrical 
properties of the vocal folds. 


IV. DISCUSSION 


The two components of the mucosal waves reflect, in 
principle, two different events. Whereas the mucosal 
movements on the medial surface and the corresponding 
vertical phase differences are actively driven by the 
glottal airflow, the laterally traveling waves on the upper 
vocal fold surface are passive continuations of the 
vertically traveling mucosal waves. The sharpness of the 
lateral peaks theoretically reflects pliability of the medial 
vocal fold surface. The sharper the lateral peaks, the 
larger the vertical phase differences, and the more pliable 
the medial vocal fold surface [5,6]. 

The new definition was found useful for recognizing 
the laterally traveling mucosal waves and distinguishing 
them from other events and artifacts on the vocal folds. 
The laterally traveling mucosal waves reveal on the 
pliability of the upper vocal fold surface. Theoretically, 
the larger the lateral extent of the wave, the more pliable 
the mucosa of the upper surface is [1]. Mucosal wave 
enhancement by a specular light reflection indicates 
horizontality of the upper vocal fold surface; separation 
of the mucosal wave from the margin suggests an 
enlarged amount of incompressible material in the 
mucosa (such as edema-fluids); and the appearance delay 
suggests that the upper surface of the vocal folds is 
extensively bulged (e.g., from excessive activity of the 
external part of thyroarytenoid (TA) muscle or due to 
structural abnormalities), The suggested new 
classification of the mucosal wave properties opens new 
possibilities for more sensitive monitoring of the state of 
the vocal fold tissues in basic voice research and clinical 
practice. 
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Abstract: A 3D biomechanical finite element model of 
the face is presented. Muscles are represented by piece- 
wise uniaxial tension cable elements linking the 
insertion points. Such insertion points are specific 
entities differing from nodes of the finite element mesh, 
which makes possible to change either the mesh or the 
muscle implementation totally independently of each 
other. Lip/teeth and upper lip/lower lip contacts are also 
modeled. Simulations of smiling and of an Orbicularis 
Oris activation are presented and interpreted. The 
importance of a proper account of contacts and of an 
accurate anatomical description is shown. 

Keywords : Face models, Muscle modeling, Lip/teeth 
interaction. 


I. INTRODUCTION 


Many biomechanical models of the human face have 
been proposed in the literature. They were generally 
developed either in the context of computer graphics 
animation [1,2], or of computer-aided maxillofacial 
surgery [3,4] or of speech production studies[5]. Most of 
them propose to model the face with a volumetric mesh 
defined by an external (the “visible” part of the face) and 
an internal surface (the part in contact with the skull), 
with some nodes or layers in between. Mechanics of the 
tissues (epidermis, dermis, hypodermis, fat and muscles) 
is then modelled through the relation between 
displacements and forces of mass points, or through 
strain/stress relations in the case of finite element models. 
These studies have raised a number of important issues: 
(1) how to model muscles fibres and their action on the 
3D mesh; (2) how to account for the subject/patient 
specific muscular morphology (in terms of fibres 
insertions and interweaving); (3) how to control the large 
number of muscles in order to produce a given speech 
articulation or facial mimics. Both latter points have 
already been addressed and discussed by our group, 
respectively for computer-aided craniofacial surgery [4] 
and through a motor control model of tongue muscles 
activation for speech production [6]. 


This paper deals with the first issue and proposes a 
method to define from Computer Tomography (CT) 
images a subset of muscles fibres within a 3D mesh of the 
face. Contacts between lips and teeth are also handled. 
First results of facial mimics' simulations are presented. 


II. METHODS 


The starting point of this modeling work is the 3D 
Finite Element model of the face soft tissues, built out of 
CT scan of a single patient, which was originally 
proposed in [4]. It relies on a volumetric mesh consisting 
of hexahedrons and wedges elements (Fig. 1, left). The 
displacements of several nodes located on the internal 
surface of the facial mesh are constrained in order to 
represent attachments of the facial tissues on the skull. 

While biological soft tissues are known to behave non- 
linearly [7], they were first represented by a 
homogeneous, isotropic, linear material. This hypothesis 
was retained in a first stage in order to focus on muscles 
modelling and contact management, before ongoing with 
more realistic modelling. Simulations were computed 
using the ANSYS™ v11 finite element software. 

The first part of our study has consisted in building the 
muscles involved in facial mimics' generation. In order to 
ensure anatomical and physical reliability, muscles 
courses and insertions were directly defined from medical 
images and anatomical charts (Fig.1, middle), with the 
help of a maxillofacial surgeon. The locations of points 
describing the muscle fibres were measured in the 
different scan slices. These points were then integrated 
into the mesh to model muscle insertions. They were 
linked with piece-wise uniaxial tension cable elements to 
model muscle fibres (Fig. 1, right). The Orbicularis Oris 
muscle was designed slightly differently: it is represented 
by two ellipsoid cable elements centred on the mouth 
opening, and representing the marginal and the peripheral 
parts of the muscle. 

The cable elements based approach allows integrating 
muscles into the model independently of the mesh itself. 
Therefore, the mesh can be easily refined or modified, 
without requiring any change in the muscles structure 
definition. The fibers cable elements are controlled in 
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tension by their cross section area, their initial strain and 
an activation parameter. They generate forces that are 
applied to the soft tissues mesh thanks to the notion of 
dependencies. In other words, muscular fibres extremities 
are linked with the facets of the surrounding mesh 
elements. When a muscle is activated, the corresponding 
cable elements exert forces on the mesh elements and 
induce, then, soft tissue deformations. 
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The second part of our study concerns lips-teeth and 
upper-lip/lower-lip contacts, which are of primary 
importance in lips movements and deformations. Teeth 
are materialized in the model by surfaces extracted from 
the CT data and interpolated with Spline functions. 
ANSYS contact elements, which provide collision 
detection and sliding reaction, are used to mesh lips and 
teeth surfaces. 


Figure 1 Left: the finite element mesh of the face soft tissues. Middle: interactive segmentation of muscles fibers on 
CT data. Right: location of eleven muscles involved in facial mimics on the left side of the skull. 


HI. RESULTS 


a. Simulating smiling lips. 

Figure 2 presents the mesh deformations (in 
colours/grey scale) and the final face shape when the 
Zygomaticus Major, the Risorius and the Levator Labii 
Superioris are activated simultaneously. The bottom 
panels show the results when lip/lip and lip/teeth 
contacts are taken into account, while the top panels 
show the results obtained without contact. In presence 
of contacts, the mimic seems more realistic. This can be 
particularly well assessed on the side views (right 
panel), where, in the absence of contacts, an 
interpenetration of the lips can be noticed. On the 
contrary,, lips are slightly opened in the other condition, 
which is in agreement with data on smiling. 


b. Orbicularis activation and lips protrusion. 

An important issue for speech production concerns 
the control of protruded lips, such as in the production 
of /u/ or /y/ in French. It has been suggested in the 
literature [8] that the interaction between an activation 
of the Orbicularis Oris and lip/lip and lip/teeth contacts 
could be responsible for this particular lip gesture. This 
hypothesis was tested with our face model. 


Two types of simulations were then run with the 
activation of the Orbicularis Oris, without (Figure 3 top 
panels) and with (Figure 3 low panels) handling 
contacts. In the absence of contacts, a rounding of the 
lips is observed (top panel, left). Rounding is classically 
associated with protrusion. However, the side view (top 
panel, right) shows a strong retraction of the lips, which 
is at the opposite of a protrusion. Including the contact 
limits the retraction, but it does not generate any 
protrusion or rounding. 

The absence of protrusion can be explained by the fact 
that our model does not separately control the marginal 
and the peripheral parts of the Orbicularis Oris. Honda 
et al. [8] found namely an EMG activation during 
protrusion only in the peripheral part of the muscle. 
Gomi et al. [5] confirmed this finding with their 
biomechanical lip model. However, the absence of 
rounding is in agreement with Gomi et al.'s [5] 
statement (p.130) who suggested that not only the 
Orbicularis Oris, but also "additional muscles 
combinations (jaw opening, peripheralis or other 
perioral muscles) would be required to form rounded 
lips. mow 
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IV. DISCUSSION AND CONCLUSION 


The face model presented in this paper integrates an 
original representation of muscle fibres and muscle 
force generation in a 3D mesh based on piece-wise 
uniaxial tension cable elements. It also models contacts 
between upper lip and lower lip and between lips and 
teeth in a realistic way. 

Simulations of smiling lips show that the proposed 
muscle representation is adapted to the generation of lip 
deformations that are realistic both in amplitudes and in 
directions. This is an important result since this 
representation can be implemented independently of the 
mesh. It will then facilitate the generation of speaker 
specific mesh using mesh matching algorithm [9]. It 
will also increase the efficiency of such a modelling 
approach to study the impact of face surgery on smiling 
and on mimics in general. Indeed, the geometrical 
structure of the mesh can easily be modified to account 
for different kinds of surgeries, without inducing a 
careful, difficult and long redefinition of all muscles 
fibres in the mesh. 
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Our results show also the importance of contacts 
modelling, both for lips/teeth interactions, but also for 
upper lip/lower lip interaction. It is important, not only 
because it prevents for unrealistic interpenetrations, but 
also because it allows sliding movements that constrain 
and guide the movement. This is certainly a 
phenomenon underlying complex lip shaping such as 
protrusion and rounding. 

On the other hand, the simulations of the 
consequences of the activation of the Orbicularis Oris, 
for which no distinction is made in our model between 
the marginal and the peripheral parts, do not generate 
rounding. These results are in contradiction with the 
simulations carried out by Gomi et al's [5] who did 
make this distinction. This shows that collecting 
accurate neurophysiological and anatomical data is a 
major challenge to test hypotheses about the control of 
speech gestures, once realistic biomechanical models 
are available. 


Figure 2 Simulation of smiling lips without (top panels) and with (bottom panels) handling 
of lips-teeth and upper-lower lips contacts. Displacements are in mm. 
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Figure 3 Simulation of the Orbicularis Oris contraction, without (top panels) and with (bottom panels) 
handling of lips-teeth and upper-lip/lower-lip contacts. 
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Abstract: The velocity fields along a self- 
vibrating physical model of vocal folds were 
studied experimentally. The shape of the vocal 
folds was specified according to data measured 
on excised human larynges in phonation posi- 
tion. The model was fabricated in 1:4 scale as 
a silicone body vibrating in the wall of a plex- 
iglass wind tunnel. The model is not excited 
externally and oscillates only due to coupling 
with the flow. In addition to acoustic, subglot- 
tal pressure and impact intensity measurements, 
flow velocity fields were recorded in the coronal 
plane using particle image velocimetry, in the 
domain immediately above glottis. Analysis of 
the PIV images taken within 25 phases of one 
vibration cycle gives good insight into the dy- 
namics of the supraglottal flow. 


Keywords: glottal flow, physical model, PIV 


I. INTRODUCTION 


Despite the numbers of sophisticated mathematical 
models of vocal fold vibration and glottal flow devel- 
oped in recent years, experimental approaches still play 
an important role in vocal fold research. The computa- 
tional models can supply very useful data; nevertheless, 
it is necessary to keep in mind that many models are 
based on important simplifications and that the results 
cannot be extrapolated beyond the parameter limits, 
for which they were designed. The models often cannot 
avoid to include several ad hoc assumptions. Moreover, 
in vocal fold modeling one needs to enter many geomet- 
rical and tissue parameters, whose numerical values are 
often not well known. Therefore, the results from the 
mathematical models should always be verified using 
experimental data. 


The most relevant data regarding vocal fold vibra- 
tion originate from measurements on living human sub- 
jects. However, since the human vocal folds are hardly 
accessible, the majority of processes occuring during 
phonation cannot be measured directly in vivo. The 
second possibility is to perform in vitro investigations, 
i.e. measurements on excised human or animal laryn- 
ges. This approach provides improved accessibility to 


measured structures and tissues in better controlled lab- 
oratory conditions; yet many drawbacks of experiments 
on living tissues persist — technical complications, poor 
measurement reproducibility and also ethical concerns. 
This is why several physical vocal fold models with well- 
defined and easily controllable parameters have been de- 
veloped in recent years — like the self-oscillating latex- 
tube model of Pelorson, Deverge et al. [5, 1], static 
models of Shinwari, Scherer and Fulcher et al. [6, 3], 
Kob’s or Erath’s driven scaled models [4, 2] or the self- 
oscillating 1:1 vocal fold model of Thomson et al.[8]. 


Investigation of the supraglottal flow velocity field 
represents one of the cases, where both in vivo and 
in vitro measurements are hardly realizable. There- 
fore a self-vibrating mechanical model of human vo- 
cal folds was designed and fabricated at ENSTA Paris. 
The principal goal was to study the conditions, where 
flow-induced vibrations of vocal folds occur and to in- 
vestigate the velocity fields in the supraglottal chan- 
nel immediately upstream the narrowest glottal gap by 
means of Particle Image Velocimetry (PIV). The mea- 
sured data were intended to be compared with the re- 
sults from a FEM computational model. 


II. METHODS 


The physical model was proposed as a vocal-fold- 
shaped element vibrating in the rectangular channel 
wall. A 4:1 scaled vocal fold model, oscillating only 
due to coupling with airflow, was designed (see Fig. 1). 
In current setup, the upper vocal fold is fixed to avoid 
difficulties with unsymmetric vocal fold vibration, the 
bottom one is supported by four flat springs. Best possi- 
ble effort was made to keep the important dimensionless 
characteristics of the model close to the real situation. 
The shape of the vocal folds was specified according to 
measurements on excised human larynges, performed in 
the Institute of Thermomechanics [7]. 


The vocal fold model was mounted into a plexiglass 
wind tunnel. In addition to the PIV system installed 
to measure the supraglottal flow field, the model was 
also equipped with accelerometers, pressure transduc- 
ers and microphones to measure and record vocal fold 
vibration. 
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Figure 1: Design of the physical model of vocal folds (in 
configuration fixed upper - vibrating lower vocal fold). 
The vibrating elastic silicone rubber element is attached 
to an aluminum profile, supported by four adjustable 
brass flat springs. 


To measure the mean flow in the channel, an ultra- 
sonic flowmeter was mounted near the downstream end 
of the circular channel. Two accelerometers, fixed under 
the vibrating vocal fold, were used to record mechani- 
cal vibration. The 1:4 scale of the model allowed to use 
the relatively large, but very sensitive type B&K 4507C 
without affecting the system significantly. 


III. RESULTS 


The primary purpose of the vibroacoustic measure- 
ments was to acquire supplementary data to the PIV 
records. Basically, the procedure consisted of setting 
the flow rate, taking one ten-second record of the ac- 
celerometer, pressure and acoustic signals, and perform- 
ing a series of PIV measurements for approximately 25 
phases of the vocal fold motion. This procedure was 
repeated for the flow rate values ranging from mini- 
mum flow able to sustain vocal fold vibration up to a 
maximum value, where either the vibrations ceased or 
became chaotic or irregular. 


Fig. 3 shows the measured waveforms and their spec- 
tra for a sample flow rate value, where regular vibra- 
tions with impacts occur. 


An extensive series of PIV measurements was per- 
formed on the vibrating vocal fold model. The flow rate 
was gradually increased from Q = 5.33 1/s (measure- 
ment No.001) to Q = 25.61 1/s (measurement No.044). 
Within each of the 44 measurements, approximately 25 
PIV records, corresponding to 25 distinct phases of the 
vocal fold oscillation cycle, were taken. This was re- 
alized using the synchronization signal (accelerometer 
signal converted to TTL) and the time-delay function 
of the laser control software. Each PIV record consisted 
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of ten PIV measurements of the same phase within ten 
successive vibration cycles. 


Fig. 2 demonstrates the results of one sample mea- 
surement (out of 44 in total). This measurement was 
chosen as a representative case of medium flow rate, 
large-amplitude regular oscillations, which subjectively 
correspond the best to normal voice production. 


It can be stated that the flow is not perfectly peri- 
odical in general. The turbulent structures, developing 
mainly due to presence of the boundary layer of the 
jet, interact mutually and with the jet in a disordered, 
stochastic way; this is why the flow fields of the same 
phase in successive oscillation cycles are not necessarily 
identical. The important flow structures, however, are 
generated periodically in accordance with the frequency 
of vibration: within each oscillation cycle, a new jet is 
created with one pair of large vortices propagating along 
the jet front. The jet attaches to the channel wall and 
during the closing phase it fades away and eventually 
disappears, leaving the turbulence to damp out. 


The mathematical model, which was designed to cal- 
culate the 2D velocity and pressure fields in the prox- 
imity of the vibrating vocal folds, is based on the 
2D incompressible Navier-Stokes equations in arbitrary 
Lagrangian-Eulerian formulation. The equations were 
discretized by the finite element method. The numeri- 
cal scheme was completely programmed in the Fortran 
language, making use only of open-source libraries for 
the finite element discretization and for the numerical 
solution of the resulting linear system. The results of 
the numerical simulations show the development of the 
supraglottal jet and evolution of the recirculation vor- 
tices within one vocal fold oscillation cycle. 


Figure 2: Instantaneous velocity field downstream the 
glottis. The vocal folds are on the left. The flow direc- 
tion is from the left to the right. The velocity modulus 
is in color. A free jet with a maximum flow velocity 
of U = 17 m/s forms between the vocal folds. Two 
large-scale vortices develop at the sides of the jet front. 
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Figure 3: Waveforms and frequency spectra of the acceleration, and supraglottal pressure. Measurement No. 012 
— medium flow rate Q = 8.58 1/s, ideal for regular vocal fold vibration with an impact in each cycle. Funda- 


mental frequency 13.2 Hz. On the acceleration waveform, the impact is clearly visible as a peak on the positive 
half-wave. 
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Figure 4: Sample velocity field during the vocal fold vibration cycle — velocity magnitude [m/s]. 
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Figure 5: Sample pressure field during the vocal fold vibration cycle - dynamic pressure [Pa]. 
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Figs. 4, 5 demonstrate the sample results calculated 
within a numerical simulation using typical values of in- 
put parameters. The channel geometry is the same as 
for the physical model. The mesh was triangular and 
consisted of 16537 Taylor-Hood (P?/P*) elements. The 
upper vocal fold was fixed, the motion of the bottom 
one was prescribed. 


IV. DISCUSSION 


Neither the mathematical nor the physical model was 
primarily intended for direct comparison with real hu- 
man vocal folds. The strategy was first to validate the 
mathematical model using results of the PIV measure- 
ments on the physical model; once a satisfactory cor- 
respondence between the computational and physical 
models will be achieved, the geometry and boundary 
conditions of the mathematical model can be modified 
in order to reflect the conditions occurring in real vocal 
folds. For the validation of the model, it was advan- 
tageous to use the configuration with one vocal fold 
moving and the other fixed. 


The results from the mathematical and physical 
model obtained so far seem to correspond when com- 
pared visually. It should be noted that there are some 
aspects, which make a systematic comparison difficult 
for the time being — the main limitation is the fact that 
the vocal folds are not allowed to collide. The processes 
accompanying glottal closure are complex and from the 
algorithmic point of view, the separation of the compu- 
tational domain into two, necessity to introduce addi- 
tional boundary conditions and to handle pressure dis- 
continuity when reconnecting the domains represent a 
very complicated problem. Yet it will be necessary to 
deal with this task in future, if the mathematical model 
should be employed to model regular loud phonation. 
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Abstract: Pathology detection by the analysis of voice 
remained a challenging objective during last years’ 
research. Former studies have shown that the glottal 
signature defined on the glottal source spectral density 
contains helpful hints in determining the healthy or 
pathological condition of the subject. Therefore a 
good reconstruction of the glottal source after 
removing the vocal tract is essential for the study. 
Nevertheless it may be shown that gender is also 
strongly influencing this glottal signature. Therefore 
comparisons and clustering between healthy and 
pathological glottal signatures should have into 
account the patient’s gender. Through the present 
paper a new scheme for pathology detection is 
presented based on a priori determining the subject’s 
gender. Results from a database of healthy and 
pathological subjects are given to contribute in this 
sense. A study case is presented as an example in 
visualizing the interest of the approach. 


Keywords: Glottal Signature, Gender Detection, 
Pathology Detection 


I. INTRODUCTION 


The spectral signature of the Glottal Flow Derivative is 
very much conditioned by the biomechanics of the glottal 
folds. Physiological and functional pathologies introduce 
important changes in glottal fold mechanical behaviour 
resulting in perceptible changes in the spectral profile of 
the Glottal Signals, which shows specific peaks and 
valleys (“V-troughs”) related with resonances and anti- 
resonances of the vocal fold biomechanics (see Fig.1), as 
these are due to relations among equivalent masses and 
springs in classical k-mass models of the vocal folds. In 
former studies [1][2] it was established that the statistical 
distributions of the spectral profiles of the glottal flow 
derivative (glottal source) and mucosal wave correlate 
(the residual after the average acoustic wave is removed) 
are conditioned by gender. Therefore any study 
conducted to detect pathology using parameter 
distributions from the spectral profile of glottal signals 
has to have into account gender effects. The present study 
is intended to propose a methodology to detect pathology 
and assess its treatment using gender-specific glottal 


source spectral parameters. The validity of the 
methodology will be checked on a study case. 


II. GLOTTAL SIGNATURE 


Being a well established fact that a relation between the 
spectral signature of voice and pathology exists [3] recent 
studies have established time and frequency domain 
parameterizations to carry out such comparisons. 
Especially interesting are the relations among the 
amplitudes of the first harmonics and formants in the 
spectral contents of voice (H)-A;, H)-A3, A)-A3, H)-A, 
etc.) as well as on the glottal source spectral envelope [4]. 
Extending these definitions the present work presents a 
parameterization of the glottal signature which may be 
seen as a generalization of formant-harmonic relations. 
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Figure 1. Power spectral signature of the mucosal wave 
correlate from a typical male speaker based on the estimation of 
the two first “V-troughs” {Tw fa}, € Tmi fmi} and (Tao, fue}. 
from which 8 singularity parameters are derived {p,s, P19 P21, 
P22 P27, P28 P30, P31}. The decay rate and the notch slenderness 
of both troughs have been also added as parameters to the 
analysis set {p32, p33 and p34} as explained in the text. Relative 
amplitude is given in dB. 


The parameterization proposed is based on the estimation 
of the singularities in the power spectral density of the 
glottal source residual after removing the average 
acoustic wave (mucosal wave correlate) shown in Fig.1. 
The parameterization is based on the estimation of each 
singularity amplitude and position (peaks and troughs) 
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relative to the largest peak (Tan, fw} in the spectral 
distribution as 


Ími 

Tmi =Tm1-TMi; Pmi = (1) 
MI 
Fm 

tm =Tm -Tm Pea =e (2) 
MI 


therefore implicitly 7y7=0 and py=1. The definitions for 
the first trough may be extended to any other in the 
spectral profile (provided that these meet certain 
conditions), assuming that each minimum at fng follows a 
maximum at fuk<fmk as given by 


T =T y -T 
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and correspondingly for the slenderness factor of the 
trough 


Af mki- f) 


This last factor is strongly related with the tensions of the 
springs linking the corresponding masses on the k-mass 
equivalent biomechanical model originating the peaks 
and troughs, and is a measure of the stress on vocal fold 
cover. The complete definition of the glottal signature is 
the following 
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It is of most importance to emphasize that the 


parameterization scheme proposed is normalized both in 
amplitude and frequency and pitch-independent. 


II. METHODS 


A database recorded within project MAPACI [5] was 
used in the study, which is available for researchers upon 
request. The database contains both normal and 
pathologic cases assessed by video-endoscopy, EGG and 
GRBAS evaluation. A first study on the distribution of 
glottal spectral profiles by gender was carried out 
demonstrating that both genders are subject to different 
dispersion profiles [6], affecting mainly to specific 
parameters. Based on these results a second study was 
launched to determine which parameters played a more 
important role in gender detection [2]. The results of this 
study are summarized in Figure 2 showing that gender 
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may be blindly assessed from the spectral density of the 
glottal source, p32 being among the most gender-sensitive 
parameters, together with ps, p3y and pj», in respective 
order from larger to smaller sensitivity. Although a wider 
study on the statistical relevance of the parameter set is 
still pending, this study helped in determining that 
pathology studies should take into account the gender- 
sensitivity of distortion parameters as well. Therefore it 
called the attention on that any pathology study should 
take into account gender issues, pointing to the need to 
establish pattern comparison strategies within male or 
female profiles accordingly with the patient’s gender for 
a better detection and classification. 
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Figure 2. Example of unsupervised k-means clustering by 
gender produced using the parameters related with the spectral 
envelope energy decay (p32), and the slenderness of the first 
two “V-troughs” (p3; and p34). A set of 100 equally balanced 
normal speakers is separated as male (9) and female (0) 
clusters. Male and female voice samples are clearly separated 
by p32=80. The only two mis-classified cases (male clustered 
as female) are pointed by arrows (++141 and #/F3). 


Therefore another study was launched aiming to detect 
pathology using clues present in the glottal signature of 
female voice, as a first approach. For such, a set of 24 
pathologic cases were selected from the data base of 
female speakers including 8 Reinke’s Edemae, 8 Nodules 
and 8 Functional. A second set of 24 normal female 
speakers was randomly selected from the normal database 
to serve as control group. Classical distortion parameters 
as jitter, shimmer and HNR were mixed with the glottal 
spectral profiles and dynamic estimates of the vocal fold 
body and cover biomechanics (masses, loses and 
tensions) from indirect power spectral density inversion 


[1]. 
HI. RESULTS 


A general study of parameter relevance was conducted 
using k-means clustering, PCA dimensional reduction, 
and Back-annotation. Sample utterances of vowel /a/ 
from the study subjects were processed to obtain their 
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glottal source and mucosal wave correlate. The glottal 
signature described in the introduction as well as classical 
distortion parameters {p,...p,4} as jitter, shimmer and 
HNR were used in the study, as well as biomechanical 
parameters {p35...p45} associated to masses, stiffness and 
losses for the vocal fold body and cover as described in 
[7] to compose a 46 parameter vector  {py...pss} 
associated to each phonation cycle from the utterance 
considered. Based on PCA techniques a subset of the 16 
most relevant parameters describing the statistical 
dispersion of the samples was determined for selecting 
parameters better discriminating normal from pathologic 
phonation. Figure 4.a shows normal and pathologic 
sample distributions in terms to the three parameters 
scored among the most relevant ones from back- 
annotation: jitter from classical distortion parameters (p2), 
the depth of the second v-trough of the glottal signature 
(p21) and the estimated loses in the vocal fold cover from 
biomechanical parameters (p42). It must be mentioned 
that this last parameter is indirectly connected with the 
shimmer (p3). Clear pathologic cases (0) are characterized 
by a wide dispersion and large parameter values in this 
3D subspace whereas normal phonation (¥) shows small 
parameter values and small dispersion, mild pathologic 
cases (©) spreading in between normal and strong 
pathological cases. 


IV. A STUDY CASE 


To illustrate the use of this clustering technique a study 
case was carried out using data from a 34-year old 
female, non-smoker, theatre actress, reporting chronic 
dysphonia, vocal fatigue, changes in loudness and soaring 
during speaking or singing as a result of a polyp on the 
right vocal fold as shown in Figure 3. Data from this 
patient before (#0E8) and after surgery (#2DC) were 
introduced in the database as if produced by two different 
speakers for their comparison against normal and 
pathological cases as shown in Figure 4.b. 


Figure 3. Study case described in the related paper [8]. The 
left and right templates show images of pre- and post-surgery 
vocal folds in treating a gelatine-type polyp (pointed by 
arrow). 


The consequence of the comparison is rather interesting. 
The glottal signature corresponding to #0E8 (encircled in 
full line) produced from pre-surgery data was labelled by 
the clustering methodology as member of the subset of 
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mild pathological cases (0). After surgery the situation 
changed essentially as #2DC was allocated inside the 
grouping of normal phonation subjects encircled in dash- 
dot, labelled as (w). The arrow shows the relative 
transition of the subject’s condition from one to the other 
case. 
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Figure 4. Pathologic vs normal clustering of female samples 
using three distortion parameters (2: jitter, 21: second trough 
minimum, 42: cover loses). a): overview of the general 
clustering of normal and pathologic samples. Normal 
phonation is clustered in the left lower hand side corner (dash- 
dot: minimum jitter, depth and loses). b): close-up view of a 
study case in the same 3D representation space from pre- 
surgery (#0E8 encircled in full line) to post-surgery (#2DC in 
dot line). The treatment results appear as a translation in the 
representation space from the mild pathological to the normal 
cluster. 


IV. DISCUSSION 


The re-allocation of the same patient’s data after surgery 
within the normal’s cluster is to be found on the strong 
changes observed on the respective spectral signatures of 
the glottal source as derived from pre- and post-surgery 
voice records, given in Figure 5.a and b. It may be 
appreciated there that the spectral contents of the glottal 
source change drastically from before to after surgery 
conditions. In Figure 5.a the harmonic structure of the 
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glottal source between 1500 and 3200 Hz is almost 
inexistent, whereas it has been completely restored in 
Figure 5.b. This improvement in voice quality justifies 
the translation of the associated vector in the 3D 
representation space of Figure 4 (bottom) from pathologic 
to normal clusters. 
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Figure 5. Glottal Source Power Spectral Signature for a 


pathological case. a) pre-surgery. b) post-surgery. Horizontal 
axes given in Hz. 


V. CONCLUSIONS 


Some interesting conclusions may be derived from the 
work presented. First of all, 1t has been shown that certain 
glottal signature parameters are  gender-sensitive, 
allowing unsupervised clustering of male and female 
voice samples. This sensitivity may be extensible within a 
larger or smaller extent to other distortion parameters. A 
direct consequence of this finding is that any parameter 
template involved in pattern recognition processes using 
glottal behaviour may be coding not only pathology, but 
gender, as well as other subject’s characteristics. 
Therefore gender issues will have to be taken into 
account as far as pathology detection —and possibly 
classification- is concerned. Taking these facts in mind it 
was possible to distribute template vectors corresponding 
to glottal signature parameters of normal and pathological 
cases within a representation subspace showing distinct 
pattern distributions accordingly. Finally a study case 
helped in testing the ability of the glottal signature to 
represent dynamic changes in voice quality in pre- and 
post-surgery. The clustering classified quite accurately 
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both situations as pertaining to pathologic or normal 
cases, thus serving as an assessing benchmark for the 
availing of the surgical treatment of the pathology. This 
may be of great interest to further improve pathology 
detection methods. 
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Abstract: The approximate entropy of vowel 
phonation spectra has been used to reveal two normal 
voicing groups in healthy males. An analysis of the 
corresponding open and closed quotients for the 
groups shows the first with balanced quotients and the 
second with asymmetric quotients characterised by 
pronounced vocal fold open phases. This indicates the 
second group is impacted by turbulent air flow, which 
reduces both the spectral structure in and 
approximate entropy of vowel phonation. The 
corresponding spectra are presented to confirm the 
effect. This provides a physiological explanation for 
the success of approximate entropy as a single figure 
of merit for voicing quality. 

Keywords voicing, entropy, quotients, spectra, 


physiology 
I. INTRODUCTION 


Voicing quality in healthy male individuals has 
recently been quantified using a single figure of merit 
based on the complexity of extended vowel power spectra 
[1]. The power spectra were derived from stationarised 
electro-glottogram (EGG) measurements of sustained /1/ 
phonation and normalised to counteract the dynamic 
characteristics of the fundamental frequency. In the form 
of approximate entropy (ApEn), complexity analysis was 
then able to divide cohorts of healthy males into two 
Statistically distinct power spectral groups, which were 
labelled G1 and G2. The former had ‘bright’ spectral 
characteristics with well defined peaks, whilst the latter 
exhibited depressed or ‘dull’ spectral characteristics. 
Fig.1 shows the dominant, bright G2 group had an ApEn 
that is typically twice that found in the dull G2 group. In 
subsequent work, the recovery of voicing quality in 
cancer patients following radiotherapy was studied using 
the characteristic ApEn values of the G1 and G2 groups 
as normal reference standards [2]. Many patients 
presenting with pathologically low ApEn values 
recovered voicing with normal G1/G2 ApEn levels one 
year after treatment. 

The EGG measurements underpinning these studies are 
known to correlate well with the glottal waveform, which 
in turn is a reflection of the physiological process of 
vocal fold vibration [3]. This paper provides a clinical, 
explanation for the success of spectral domain complexity 
analysis using spectral approximate entropy. It also 
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Fig. 1 
Dual Gaussian mixture maximum likelihood fit to healthy 
male ApEn showing dual peaking (GI and G2) 
(P<0.001). Ordinate;probability density. Abscissa; ApEn. 


provides further evidence for the existence of the G1 and 
G2 groups of voicing normality. The distinctive G1 and 
G2 spectral characteristics are shown to be consistent 
with the time the vocal folds spend in their open or closed 
phases during phonation. Furthermore, it is shown that it 
would not have been possible to separate, and therefore 
establish, the normal Gl and G2 groupings purely from 
the fundamental frequencies of vowel phonation. 


II. METHODS 


A cohort of 85 healthy male volunteers was recruited 
through institutional advertising. An EGG was acquired 
for each subject using sensors attached across the thyroid 
cartilages. The sensors were connected to a PC controlled 
electro-laryngograph under the expert guidance of speech 
and language therapists. Each subject was asked to 
phonate the vowel /i/. The EGG impedance and acoustic 
signals were digitised at a sampling rate of 20 KHz, 16 
bits per sample, for a total of approximately 4 seconds 
recording time. The data-files were transmitted by 
network to a Pentium PC system for complexity analysis 
using software written in scientific language IDL. 

The distribution of ApEn values was analysed for dual 
peaking using maximum likelihood [4]. Fundamental 
frequency (Fo) and vocal fold open and closed quotients 
(OQ and CQ respectively) were recorded for each subject 
using the electro-laryngograph. 
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Fig.2 
Histogram of the number of subjects in G1 (ordinate) 
against time taken (in percentage) for larynx open and 
close phases (abscissa). Black shaded bars represent 
closed phase/quotient (CQ) and shaded bars represent 
opened phase/quotient (00). 


The average spectrum for each subject, normalised to 
counter the effects of changing fundamental frequency 
and transformed onto a harmonic scale (fundamental- 
harmonic normalization, FHN [1]), was used to form a 
single image line in the construction of G1 and G2 group 
FHN-spectrograms. 


III. RESULTS 


There were no Fy differences between the two normal 
male groups G1 and G2, which showed mean F) values of 
124 Hz (+/- 28 Hz) and 122 Hz (+/-29Hz) respectively. 

Open and close phases/quotients for G1 are shown in 
Fig.2. Both phases have a relatively symmetrical 
distribution either side of the 50% mark, with the open 
quotient only marginally higher than the closed quotient. 
In contrast, the open and closed quotients for group G2 
shown in Fig.3 are clearly separated. The G2 open phase 
is clearly much longer than the closed phase. 

The Gl and G2 group FHN-spectrograms are shown in 
Fig.4, left and right respectively. These show the Fo 
peaks of the subjects in each group aligned to form a 
distinct left hand column, followed by seven further 
harmonic columns extending to the right The harmonic 
content of the G2 group is weak and lacking in detail 
away from Fo. In contrast, the harmonic content of the 
Gl group maintains its strength across the harmonic 
range. This matches the observation that some of the 
volunteers had clear ‘bright’ voicing whilst others were 
less distinct or ‘dull’. 

The group FHN-spectrograms are consistent with a 
significantly increased open quotient for G2 and an 
increase in turbulent flow past the vocal folds. 
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Fig.3 
Histogram of number of subjects in G2 (ordinate) against 
time taken (in percentage) for larynx open and close 
phases (abscissa). Black shaded bars represent closed 
phase/quotient (CQ) and shaded bars represent opened 
phase/quotient (00). 


Fig.4 
Left: Gl male spectrogram. Lines composed of FHN 
spectra for each subject. Ordinate; arbitrary subject. 
Abscissa; harmonic scale with fundamental peak forming 
the leftmost, bright vertical column. 


Right: G2 male spectrogram composed as for G1. Note 
the weakness of the harmonics evidenced by fading out as 
one progresses from left to right. 


IV. DISCUSSION 


Modern complexity analysis began in earnest in the 
1950s when Kolmogorov and Sinai developed their KS 
entropy statistic. Their aim was to assess non-linear 
dynamic systems whose complex behaviour showed 
changes from regular to irregular states. KS entropy is 
zero for regular systems and positive-finite for irregular 
systems. Irregular systems are often termed chaotic to 
distinguish them from ones that are random and 
characterised by infinite entropy [5]. However, the 
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calculation of KS entropy for real world signals was 
delayed, because it requires an impractically large 
amount of data. Subsequently, Pincus [6] developed a 
more pragmatic algorithm for calculating the entropy. 
Since it emerged from heuristic basics, the term 
‘approximate’ entropy, or ApEn, is now used to describe 
his measure. 

ApEn measures the degree of irregularity by measuring 
the frequency with which patterns of a given length 
appear in a data sequence. In a highly irregular sequence, 
each possible pattern appears with roughly equal 
frequency, and so gives rise to a high ApEn. In contrast, a 
highly regular sequence tends to contain a predominance 
of one or more patterns and a scarcity of other patterns, 
thus yielding a low ApEn. Finally, a uniform sequence 
contains just one pattern, the simplest being flat, which 
results in an effectively zero ApEn. The applicability to 
voicing data series, to distinguish regular, irregular and 
random structuring, benefitted from a move to the 
spectral domain to assess the entire spectral pattern rather 
than selected peaks [1]. 

The broad ApEn cases above correspond to the G1 and 
G2 group spectral features, where G2 is degraded by 
turbulent flow through vocal folds that are incompletely 
closing during phonation. Note here that the resultant 
turbulent flow will inevitably produce acoustic noise that 
is only random in nature in the time domain. A decrease 
rather than an increase in ApEn is seen for G2, since the 
ApEn analysis in this paper is performed in the inverse 
spectral domain, where noise acts to flatten features. 
Ultimately, for white noise in the time domain, a uniform 
spectrum in the spectral domain with zero ApEn would 
be the result. 

Hence the move from full vocal fold closure in the 
‘bright’ G1 normal group, to partial vocal fold closure in 
the ‘dull’ G2 group is effectively charting the decline of 
voicing to a pathological form, which is reported for 
radiotherapy patients in [2]. The OC/QC figures are 
physiological evidence that supports the discovery of Gl 
and G2 groups, providing a clinical rationale for a 
discovery initially made using spectral ApEn as a single 
figure of structural merit. 


V. CONCLUSION 


Open/closed quotients for healthy males phonating the 
vowel /i/ show the existence of two groups. One has 
approximately equal quotients, The other has asymmetric 
quotients, indicating prolonged vocal fold opening, 
suggesting a consequent increase in turbulent air flow. 
The quotient groups correspond to the Gl and G2 groups 
identified earlier using spectral pattern approximate 
entropy analysis. Spectral approximate entropy analysis 
of voicing normality has been shown to be consistent 
with underlying physiological behaviour. 
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Abstract: In this paper an efficient fuzzy wavelet 
packet (WP) based feature extraction method has 
been used for the classification of normal voices and 
pathological voices of patients suffering from 
unilateral vocal fold paralysis (UVFP). Mother 
wavelet function of tenth order Daubechies (d10) has 
been employed to decompose signals in 5 levels. Next, 
WP coefficients have been used to measure energy 
and Shannon entropy features at different spectral 
sub-bands. Consequently, to find discriminant 
features, signals have been clustered in 2 classes using 
fuzzy c-means method. The amount of fuzzy 
membership of pathological and normal signals in 
their corresponding clusters is considered as a 
measure to quantify the discrimination ability of 
features. Thus, considering this measure, an optimal 
feature vector of length 8 has been chosen to 
discriminate pathological voices from normal ones. 
Feature vector obtained by considering nodes’ 
discriminant ability with classification percentage of 
100 has a better performance in comparison with the 
feature vector including equal portion of nodes for the 
features of energy and entropy with the approximate 
classification percentage of 96. The simulation results 
show that fuzzy WP based feature extraction is an 
effective tool in voice signal analysis. 


Keywords: Voice disorders, feature extraction, wavelet 
packets, fuzzy sets 


I. INTRODUCTION 


Unilateral vocal fold paralysis (UVFP) occurs from a 
dysfunction of the recurrent or vagus nerve innervating 
the larynx and causes a characteristic breathy voice. 
UVFP most commonly occurs following a surgical 
iatrogenic injury to the vagus or recurrent laryngeal nerve 
resulting in glottal incompetence, either partial or 
complete, because of the poor or reduced vocal fold 
closure. 

Physiological alterations of vocal cords cause 
unhealthy patterns of cords’ vibration and the decrease in 
patients’ speech signal quality known as voice 
pathologies. Therefore, the detection of incipient 
damages to the cords is useful in improving the 


prognosis, treatment and care of such pathologies. 
Physicians often use invasive techniques like Endoscopy 
to diagnose symptoms of voice disorders. It is, however, 
possible to identify disorders using certain features of 
speech signal in a non-invasive way [1]. Schuck et al [2] 
have used Shannon entropy and energy features of 
wavelet packet decomposition and the best basis 
algorithm for normal/pathological speech signal 
classification. Fonseca et al [3] have employed mean 
square values of reconstructed signals in discrete wavelet 
transform sub-bands and least square support vector 
machine (LS-SVM) classifier for identification of signals 
from patients with vocal fold nodules and normal signals. 
Guido et al [4] have tried different wavelets on the search 
for voice disorders. Mother wavelet of Daubechies with 
support length of 20 (db10) was found as the best wavelet 
for speech signal analysis among commonly used 
wavelets. Behroozmand ef al [5] have used genetic 
algorithm for optimal selection of wavelet packet based 
energy and Shannon entropy features for identification of 
patients’ speech signal with unilateral vocal fold paralysis 
(UVFP). The results showed that the decomposition level 
of five is the best level to analyze pathological speech 
signals. Local discriminant bases (LDB) and wavelet 
packet decomposition have been used to demonstrate the 
significance of identifying discriminant WP subspaces in 
a work by Umapathy et al [6]. 

Fuzzy wavelet packet based feature extraction method 
has been proposed by Li ef al and has been applied to 
biological signal classification [7]. In contrast to the 
standard methods of feature extraction used in WPs, this 
method of discriminatory feature extraction from wavelet 
packet coefficients is based on the fuzzy set criterion. 
Yang et al [8] have applied fuzzy wavelet packet method 
to feature extraction from electroencephalogram (EEG) 
signals. The results show that this method is promising 
for the extraction of EEG signals in brain-computer 
interfaces (BCIs). 

This work aims to identify patients with UVFP by 
extracting an effective feature vector containing less 
number of features and higher discrimination accuracy 
and lower order of computational complexity (e.g. in 
comparison with the one obtained by genetic algorithm 
based optimal feature [5]). It is based on wavelet packet 
transform (WPT), fuzzy sets, and artificial neural network 
(ANN) classifier. 
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II. METHODS 
A. Wavelet Packet Transform 


Recently, wavelet packets (WPs) have been widely 
used by many researchers to analyze voice and speech 
signals. There are many outstanding properties of wavelet 
packets, which encourage researchers to employ them in 
many widespread fields. It has been shown that sparsity 
of coefficients’ matrix, computational efficiency, and 
time-frequency analysis can be useful in dealing with 
many engineering problems. The most important, 
multiresolution property of WPs is helpful in voice signal 
synthesis. 

The hierarchical WP transform uses a family of 
wavelet functions and their associated scaling functions 
to decompose the original signal into subsequent sub- 
bands. The decomposition process is recursively applied 
to the both low and high frequency sub-bands to generate 
the next level of the hierarchy. WPs can be described by 
the following collection of basis functions: 


Wy, (2x1) = V2"? Y h(m -21)V2?W, (2?x-m) (1) 
‘nat (2? 1x1) = V2"? Y gm - 22” W, (27 x—m) (2) 


W, 


where p is scale index, / the translation index, 4 the low- 
pass filter and g the high-pass filter with 


gk) = =D'M(1=k) (3) 


The WP coefficients at different scales and positions 
of a discrete signal can be computed as follows: 


cr, =? ¥ fmWw,@'m-h) (4) 
a SY h(m —21).C?,, (5) 
Cr = > g(m-2.C?,, (6) 


For a particular sequence of wavelet packet 
coefficients, energy in its corresponding sub-band can be 
computed as: 


n 


Energy, = Y 


k=1 


p |? 
Ch, 


(7) 


The Shannon entropy as another extracted feature for 
classification of signals can be computed through the 
following formula: 


n 


Entropy, = > 


k=1 


2 2 
log|C?, 


p 
Cak 


(8) 
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Due to the noise-like effect of irregularities in the 
vibration pattern of damaged vocal folds, the distribution 
manner of such variations within the whole frequency 
range of pathological speech signals is not clearly known. 
Therefore, it seems reasonable to use WP rather than 
discrete wavelet transform (DWT) to have more detail 
sub-bands. 


B. Fuzzy Set-Based Feature Selection Criterion 


With fuzzy sets we allow any pattern x, to belong to 
several classes to varying degrees. Assuming wu, a 
membership grade of pattern x, to class i we have: 


j=l 


. aie 
7 - El ol pu) i 0) 


where c is the number of clusters, v, = > <4 Xx / N, is the 


k 
mean of class i, A; is the set of indexes of the training 
patterns belonging to class i, N; is the number of class i 
training patterns, | . ||is the Euclidean distance and b >1 


is the fuzzification factor that modifies the shape of 
membership grades. For the labeled training patterns in 
feature space, X, we define a membership function based 
on the criterion F(X)e(0,N] to evaluate the 


classification ability of X as follows: 


F(X)=Y Ya (10) 


i=l keA; 


The larger the values of F(X), the higher the 
classification (discrimination) abilities of the feature 
space X. 

In fuzzy set based optimal WP decomposition for 
each labeled original signal a full WP decomposition to 
maximum level of five has been performed. The mother 
wavelet function is chosen to be the tenth order 
Daubechies (db10). Consequently, features (i.e. energy 
and entropy) of all signals in each node have been 
clustered using Fuzzy Clustering Method (FCM). Most 
discriminant nodes have been identified according to the 
parameter F(X), and the signals’ energy and entropy in 
those nodes have been used to construct the feature vector 
applied to artificial neural network (ANN) classifier. 


C. Database 


Used in this study are sustained vowel phonation samples 
from subjects from the Kay Elemetrics Disordered Voice 
Database [9]. Subjects were asked to sustain the vowel /a/ 
and voice recordings were made in a sound proof booth 
on a DAT recorder at a sampling frequency of 44.1 kHz. 
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III. RESULTS AND DISCUSSION 


Having signals decomposed by mother wavelet of 
tenth order Daubechies to 5 levels of decomposition and 
having on hand energy and Shannon entropy at each 
decomposition sub-band, fuzzy logic based feature 
extraction method has been applied to construct an 
optimal feature vector of length 8 according to the nodes’ 
discrimination ability, which can separate normal and 
pathological (UVFP) voice signals. 

Table 1 shows the most discriminant nodes in terms 
of energy or entropy feature, with their discrimination 
abilities, (F(X)/ number of datax 100), which are 


obtained from fuzzy clustering method. 

A feature vector of length 8 has been extracted from 
the data: 1) with equal portion of discriminant energy and 
entropy nodes, 2) according to the best discriminant 
nodes in terms of energy or entropy. Consequently, 
approximately 65 percent of data has been used as the 
training data set and the remaining 35 percent are set 
aside as the test and validation data to train a feedforward 
backpropagation multilayer classifier neural network with 
3 hidden layers. 

Fig. 1 shows the wavelet packet tree and the 
participating nodes in feature vector, which are selected 
according to their discriminant ability. As can be seen, 
selected sub-bands are distributed over the whole 


available frequency ranges, which shows that 
* Diecrinunat Energy 
Danma Entropy (0) 
(1) 
(3) #4] 
i ca) 
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pathological factors do not influence specific frequencies 
which accentuates the role of WP decomposition with 
equal decomposition of both high and low frequencies. 

As a case in point, the coefficients’ energies of 
decomposed voice signals in the most discriminant node 
(31) have been illustrated in fig. 2. The efficiency and 
discrimination ability of selected node is obvious. 


TABLE 1 
PARTICIPATING NODES AND THEIR DISCRIMINATION 
ABILITY 
Discrimination 
Node Energy Entropy ability (0%) 
31 * 76.22 
34 * 73.32 
25 z 67.17 
29 7 66.95 
28 * 66.61 
37 * 66.82 
38 * 65.85 
17 A 62.71 
32 = 61.50 
5 » 61.27 
12) 
15) * [E] 
[11] 12) [13] 114) 


ee (28) (GT) ža (29 (30) 


Fig. 1. The most discriminant nodes in terms of signal energy or entropy 
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Fig. 2. The discrimination ability of the node (31) 


The simulation results show that fuzzy wavelet 
packet based feature extraction method and neural 
network classifiers are effective tools in voice signal 
analysis. Moreover, feature vector obtained considering 
nodes’ discriminant ability with classification 
percentage of 100 has a better performance in 
comparison with the feature vector including equal 
portion of nodes for the features of energy and entropy 
with the approximate classification percentage of 96. 


IV. CONCLUSION 


In this study, classification of voice signals into two 
groups of normal and patients with unilateral vocal fold 
paralysis (UVFP) has been presented. Fuzzy wavelet 
packet based feature extraction method has been 
utilized to find the optimal feature vector of length 8 
from energy and Shannon entropy features in WP 
decomposition sub-bands. In the following, the obtained 
feature vector has been passed on to a neural network 
(NN) classifier. The simulation results show that the 
fuzzy wavelet packet based selected optimal feature 
vector of length 8 applied to a NN classifier can achieve 
a classification accuracy of 100 percent, which despite 
its relatively short length, outperforms feature vectors 
obtained by other methods. 
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Abstract— The effectiveness of ten different feature 
sets in classification of voice recordings of the sustained 
phonation of the vowel sound /a/ into a healthy and patho- 
logical classes is investigated as well as a new approach to 
building a sequential committee of support vector ma- 
chines (SVM) for the classification is proposed. The 
optimal values of hyper-parameters of the committee 
and the feature sets providing the best performance are 
found during the genetic search. In the experimental 
investigations performed using 444 voice recordings of 
the sustained phonation of the vowel sound /a/ coming 
from 148 subjects, three recordings from each subject, 
the correct classification rate of over 92% was obtained. 
The classification accuracy has been compared with the 
accuracy obtained from four human experts. 


Keywords— voice pathology; feature selection; genetic 
search; support vector machine 


I. INTRODUCTION 


Automated acoustic analysis of voice is increasingly 
used for detecting laryngeal pathologies [1], [2], [3]. 
Time, frequency, and cepstral domains are usually used 
to extract features characterizing a voice signal. Analy- 
sis of the literature related to automated categorization 
of voice aiming to detect laryngeal pathologies shows 
that the categorization is usually based on one, two or 
three types of features. There are no works attempting 
to extract a larger variety of features for characterizing 
a voice signal. 

Various classifiers were used to make a decision about 
a voice signal represented by a feature vector. Gaus- 
sian mixture models [4], [5], the linear discriminant [2], 
k-NN [1], LVQ [8], hidden Markov models [6], a mul- 
tilayer perceptron [7], and radial basis function net- 
works are the most popular classifiers applied. In 
most of the studies, a two-class classification prob- 
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lem is solved, namely, a voice signal is assigned into 
a healthy or pathological class. The correct classifica- 
tion rate obtained in different studies, when solving the 
two-class classification problem, varies in a broad range: 
85.8% [8], 89.1% [2], 91.3% [7], 96% [3]. Due to a large 
variety of data sets used in the different studies, com- 
parison of the results obtained in the studies is rather 
problematic. 

This paper focuses on investigation of usefulness of a 
large variety of feature types in the laryngeal diagnos- 
tics task of categorizing a voice signal into the healthy 
and pathological classes. A committee of support vector 
machines (SVM) [9] is used to make the categorization. 
To find the optimal values of hyper-parameters of the 
classifier and the optimal feature subsets of the various 
types a genetic search procedure is applied. The exper- 
imental investigations performed have shown that the 
techniques developed allowed to significantly improve 
the classification accuracy if compared to the case of 
using the best feature set of a single type. 


II. FEATURE SETS AND FEATURE SELECTION 


Growing size of data sets in terms of features in- 
creases the variety of problems characterized by multi- 
ple feature sets. Voice characterization is also the case. 
In this study, we used ten different feature sets [10] (in 
the parentheses shown is the number of features avail- 
able): 
pitch and amplitude perturbation measures (24); 
frequency features (100); 
mel-frequency features (35); 
cepstral energy features (100); 
mel-frequency cepstral coefficients (35); 
autocorrelation features (80); 
harmonics to noise ratio in spectral domain (11); 
harmonics to noise ratio in cepstral domain (11); 
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9. linear prediction coefficients (16); 
10. linear prediction cosine transform coefficients (16). 


It is well known that not all features are useful for 
classification. Some of them can even deteriorate the 
classification accuracy. Nonetheless the large variety 
of techniques available for selecting variables for a sin- 
gle classifier, works on feature selection for classifica- 
tion or regression committees are not so numerous [11], 
[12]. It has been demonstrated that even simple ran- 
dom sampling in the feature space may be an effective 
technique for increasing the accuracy of classification 
committees [13]. In [14], [15], genetic algorithms have 
been used for ensemble feature selection, probably for 
the first time, by exploring all possible feature subsets. 
However, only one ensemble was considered in these 
works. Kim et al. proposed meta-evolutionary ensem- 
bles considering multiple ensembles simultaneously [16]. 

When solving multiple feature sets based classifica- 
tion or prediction tasks, it is desired to exploit all the 
information available with reasonable resources. Clas- 
sification or prediction based on ensemble aggregating 
members trained on different feature sets into a parallel 
structure is the usual way to solve such tasks. Studying 
the results obtained by the different authors regarding 
variable selection for ensembles, it seems that the ge- 
netic search where a chromosome encodes an ensem- 
ble is the most promising approach. However, pure 
genetic search based approaches are computationally 
prohibitive for large sets of variables, which is almost 
always the case with multiple feature sets. In this work, 
to mitigate the computational burden problem, a two 
stage ensemble generation procedure is developed. 


III. PROCEDURE 


Given a database consisting of L feature sets char- 
acterizing Q classes, the procedure to generate an en- 
semble for data classification into Q classes is summa- 
rized in the following steps. To obtain Q-class classifi- 
cation, Q(Q — 1)/2 classifiers one-against-one are de- 
signed. When Q is large the one-against-all scheme can 
be applied 
1. Design an SVM using features of the jth type for 
separating data coming from the ith pair of classes. 
Use the genetic search procedure for the design. The 
design results into optimal hyper parameter values 
and the optimal feature set FY consisting of Nj; 
features. 

2. Randomly generate K-1 additional sets of features 

ee ae of size Nij. Using the feature sets, 
train K-1 SVM classifiers to separate the ith pair 
of classes. 

3. Present the training data to all the K classifiers, 
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calculate outputs and convert them into the poste- 
rior probabilities. These probabilities will be used 
as features in the second stage. 

4. Repeat Steps 1 to 3 for all the feature types, j = 


Te sepa 
5. Repeat Steps 1 to 4 for all the pairs of classes, è = 


6. Using the probabilities as input features design a 
new SVM for separating the ith pair of classes as 
described in Step 1. The probabilities used are those 
derived from outputs of the classifiers designed for 
separation at least one class of the ith pair. The 
number of input features is equal to (2Q—3)*K*L. 

7. Repeat Step 6 for all the Q(Q — 1)/2 pairs of classes. 

8. The committee decision is obtained by aggregating 
decisions obtained from the Q(Q — 1)/2 SVMs. 

The rationale behind the use of the random feature 
sets is to increase diversity of information conveyed 
from the first stage. Each SVM of the first stage gener- 
ates one feature for the next stage. Some of the features 
may be redundant. However, since the genetic search is 
applied also in the next stage, redundant features are 
eliminated during the search. 


A. Genetic search 


Information representation in a chromosome, gen- 
eration of initial population, evaluation of population 
members, selection, crossover, mutation, and reproduc- 
tion are the issues to consider when designing a genetic 
search algorithm. 

A chromosome contains all the information needed 
to build an SVM classifier. We divide the chromosome 
into three parts. One part encodes the regularization 
constant C, one the kernel width parameter o, and 
the third one encodes the inclusion/noninclusion of fea- 
tures. The binary encoding scheme has been adopted 
in this work. 

To generate the initial population, the features are 
masked randomly and values of the parameters C and 
o are chosen randomly from the interval [Co — AC, Co + 
AC] and [oo — Ao, co + Ao], respectively, where Co and 
oo are the very approximate parameter values obtained 
from the experiment. 

The fitness function used to evaluate the chromo- 
somes is given by the correct classification rate of the 
validation set data. 

The selection process of a new population is gov- 
erned by the fitness values. A chromosome exhibiting a 
higher fitness value has a higher chance to be included 
in the new population. The selection probability of the 
ith chromosome p; is given by 


PiT SM (1) 
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where r; is the correct classification rate obtained from 
the classifier encoded in the ith chromosome and M is 
the population size. 

The crossover operation for two selected chromo- 
somes is executed with the probability of crossover pe. 
If a generated random number from the interval [0,1] is 
larger than the crossover probability pe, the crossover 
operation is executed. Crossover is performed sepa- 
rately in each part of a chromosome. The crossover 
point is randomly selected in the “feature mask” part 
and two parameter parts and the corresponding parts 
of two chromosomes selected for the crossover operation 
are exchanged at the selected points. 

The mutation operation adopted is such that each 
gene is selected for mutation with the probability pm. 
The mutation operation is executed independently in 
each chromosome part. If the gene selected for muta- 
tion is in the feature part of the chromosome, the value 
of the bit representing the feature in the feature mask 
(0 or 1) is reversed. To execute mutation in the param- 
eter part of the chromosome, the value of the offspring 
parameter determined by the selected gene is mutated 
by +Ay, where y stands for C or ø, as the case may be. 
The mutation sign is determined by the fitness values 
of the two chromosomes, namely the sign resulting into 
a higher fitness value is chosen. The way of determin- 
ing the mutation amplitude Ay is somewhat similar to 
that used in [17] and is given by 


Ay = wB(max(|y — Wal ly = 9721) (2) 


where y is the actual parameter value of the offspring, 
pl and p2 stand for parents, 8 € [0,1] is a random 
number, and w is the weight decaying with the iteration 
number: 


w = k(1 — t/T) (3) 


where t is the iteration number, k is a constant, and T 
is the total number of iterations. 

In the reproduction process, the newly generated 
offspring replaces the chromosome with the smallest fit- 
ness value in the current population, if a generated ran- 
dom number from the interval [0,1] is larger than the 
reproduction probability p, or if the fitness value of the 
offspring is larger than that of the chromosome with 
the smallest fitness value. 


IV. EXPERIMENTAL INVESTIGATIONS 


A. Data used 


The mixed gender local database we used in this 
study contains 444 voice recordings of the sustained 
phonation of the vowel sound /a/ (as in the English 
word “large”). The database built by the Department 
of Otolaryngology of the University hospital of Kaunas 
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University of Medicine, Lithuania is continuously up- 
dated by appending new recordings. The voice record- 
ings come from 148 subjects, three recordings from each 
subject. Three separate voice samples were recorded 
in a sound-proof booth on a digitized Sony Mini Disc 
Recorder MDS-101 through a D60S Dynamic Vocal 
(AKG Acoustics) microphone placed at 10.0 cm dis- 
tance from the mouth. There are 79 subjects repre- 
senting the pathological and 69 the healthy class. The 
average length of each recording is 2.4 s. The recordings 
are made in the “wav” file format at 44100 samples per 
second rate. There are 16 bits allocated for one sam- 
ple. During preprocessing, the beginning and the end 
of each recording was eliminated. 


B. Results 


Since we have a relatively small data set, the leave- 
one-out approach has been used in the tests. In the 
first set of experiments, a single classifier (SVM) has 
been used for each type of features. The optimal hyper- 
parameters of the classifier and the optimal feature set 
have been found using the genetic search procedure. 
Table I presents the results obtained from the tests. 
In the table, apart from the correct classification rate 
(CCR), there are also presented the initial N and the 
selected number of features NV. As it can be seen from 
Table I, the HNR-cepstral, Mel_coefficients, and Per- 
turbation features provided the best performance. 


TABLE I 
THE CORRECT CLASSIFICATION RATE (CCR), THE INITIAL (N), 
AND THE SELECTED NUMBER OF FEATURES (Ns) OBTAINED USING 
A SINGLE CLASSIFIER FOR EACH TYPE OF FEATURES. 


N# Type of features N CCR% N, 

1 Perturbation 24 86.22 11 
2 Frequency 100 84.22 50 
3 Mel_frequency 35 84.44 15 
4 Cepstrum 100 83.11 52 
5 Mel_coefficients 35 87.33 19 
6 Autocorrelation 80 81.78 Al 
7 HNR. spectral 11 82.44 4 
8 HNR_cepstral 11 87.78 4 
9 LP_coefficients 16 79.33 8 
10 LPCT coefficients 16 80.67 6 


In the next set of experiments, a committee was build 
according to the proposed designing procedure. Three 
versions of committees, with K equal to 0, 1, and 2 were 
explored. Since we have 10 different feature types, the 
number of available input features N for the committee 
is 10, 20, and 30, depending on the K value used. The 
results of the tests are summarized in Table II. 
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TABLE II 
THE CORRECT CLASSIFICATION RATE (CCR), THE INITIAL (N), 
AND SELECTED NUMBER OF FEATURES (Vs) OBTAINED USING THE 
CLASSIFICATION COMMITTEE. 


K N CCR % Ns 
0 10 91.01 6 
1 20 92.00 8 
2 30 92.56 13 


As it can be seen from Table II, a considerable im- 
provement in classification accuracy is obtained when 
using committees. The results indicate that the ran- 
domly selected feature sets contribute to the classi- 
fication accuracy increase. For example, the com- 
mittee made using K = 1 selects 8 features from 
the 20 available. Amongst those eight, four fea- 
tures (HNR_spectral, LP_coefficients, Mel frequency, 
and Frequency) were obtained using the original and 
four the randomly generated features sets. 

Four experienced clinical voice specialists serving as 
experts were subjected to perceptual ” blind” evalua- 
tion and classification into the ” healthy” and ”patho- 
logical” classes of the same digitized recordings of the 
sustained vowel /a/ without using any additional infor- 
mation about the subjects age, gender, diagnosis etc. 
All 444 recordings were presented to the experts in a 
mixed and randomized order. The correct. classifica- 
tion rate obtained from the experts was: 77.70, 79.05, 
79.73, and 73.20% with mean 77.42 and standard de- 
viation 2.94. Thus, when using only a sustained vowel 
/a/ as an information source, the automatic system is 
by far more accurate than the experts. 


V. CONCLUSIONS 


A new approach to building a sequential committee 
of support vector machines (SVM) for multiple fea- 
ture sets and genetic search based discrimination of 
pathological voices was presented. The approach pro- 
posed mitigates the computation burden characteristic 
to genetic search procedures exploring high-dimensional 
spaces. A considerable improvement in correct classi- 
fication rate was obtained from the committee if com- 
pared to the single feature type based classifiers. When 
acting on the same footing, the automated voice dis- 
crimination procedure was considerably more accurate 
than the human experts. 
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Abstract: We performed an international trial with a 
newly developed multidimensional assessment 
protocol for substitution voices based on the 
European Laryngological Society protocol for 
‘common dysphonia’. However, as sound production 
in SV is largely irregular, we needed some 
adaptations, in particular’ of the dimensions 
perception, acoustic analysis and visual evaluation. 
The protocol consisted of clinical information, the 
IINFVo perceptual rating scale, visual examination 
(level and quality of vibration), acoustic registration 
of vowels, vev, cvevev and text (which was later 
analysed with the Auditory Model Based Pitch 
Extractor (AMPEX)), aerodynamic measurements 
(Vital Capacity, Maximum Phonation Time, and 
Maximal Intensity), and self-evaluation (of i) voice 
quality and ii) degree of invalidity). 

Six centers participated. We retained 96 suitable files 
(out of 102). Variance analysis demonstrates 
significance for i) all perceptual parameters (except 
for Voicing), MPT and Maximal Intensity and ii) type 
of surgery and/or main anatomical vibration source. 
There is no correlation at all between the patient’s 
perception of his /her disability and perceptual 
parameters, MPT, quality of vibration. Correlation 
between the acoustical analysis and the subjective 
rating was only moderate (Pearson < 0.62; standard 
deviation >1.8). 


Keywords substitution voices, multidimensional 
protocol, acoustics, perception, dysphonia 


I. INTRODUCTION 


In 2001 the European Laryngological Society advocated 
a protocol for a multidimensional voice assessment for 


laryngeal dysphonia [1]. This assessment protocol 
consists of 5 dimensions: perceptual analysis, acoustic 
measurements, visual evaluation (videostroboscopy), 
aerodynamic measurements and self-assessment. 
However, the ELS assessment protocol seems not 
applicable to substitution voicing (SV). 

Substitution voicing is defined as voicing without two 
true vocal folds [2] and occurs after total laryngectomy 
(esophageal and tracheo-esophageal speech), partial 
laryngectomy (except for horizontal supraglottic 
laryngectomy), cordectomy from type III on (in which a 
large part of the vocalis muscle has been removed), 
severe laryngeal trauma etc. Most of these voices are 
rated as a G3 on the GRBAS scale, whereas there exists 
a large quality variety within SV [3, 4, 5, 6, 7]. 
Furthermore, the acoustic signal is largely irregular and 
can not reliably be analyzed by traditional acoustic 
programs (e.g. Kay elemetrics, EVA). Therefore, we 
tried to design a clinical assessment protocol for this 
specific type of severe dysphonia, through i) substituting 
the perceptual evaluation standard (GRBAS) and the 
acoustic assessment method, and ii) adapting visual 
evaluation, aerodynamic measurements and self- 
assessment. This manuscript describes the preliminary 
results of an international trial which is still going on. 


II. METHODS 


Perceptual evaluation scale: A new perceptual 
evaluation scale, called IINFVo, was proposed and 
studied for its reliability [8]. In this scale 5 parameters 
are defined: overall Impression (I), impression of 
Intelligibility (I), unintended additive noise (N), Fluency 
(F), Voicing (Vo). Reliability of the scores of both 
professional and semi-professional jury members was 
studied on speech samples derived from native Dutch 
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(Ghent, Belgium) Esophageal (E) and tracheoesophageal 
(TE) speakers, using i) Pearson correlation, ii) Kendall’s 
tau (an alternative indicator of inter rater agreement 
which does not require the scores to be normally 
distributed) and iii) mean absolute deviation (MAD) 
between the scores of two raters (this indicator 
represents the amount of uncertainty on the score on the 
0-10 VAS scale). 

Acoustic analysis: The wav files of the same speech 
samples were analysed by the auditory model of Van 
Immerseel and Martens [9]. This auditory model has a 
built-in pitch extractor, called AMPEX (Auditory 
Model-Based Pitch Extractor), which has been proven to 
outperform most other pitch extractors in circumstances 
with background noise. The auditory model internally 
works with signal windows that are considerably larger 
than 10 ms and facilitate the extraction of evidence for 
pitch values lower than 100 Hz. Every 10 ms, it 
produces a 27-dimensional feature vector consisting of 
23 spectral parameters, a voiced/unvoiced flag (VU = 0 
or 1), a fundamental frequency (Fo) or pitch (zero if 
unvoiced), a voicing evidence (VE) and a frame energy 
(E). 

Multicenter trial with the ELS protocol adapted for SV: 
the ELS protocol for laryngeal dysphonia, modified for 
SV (consisting of the same 5 dimensions, but with i) 
substitution of the GRBAS and the traditional acoustic 
analysis, and ii) adaptation of the other three 
dimensions) was lately tested by several international 
centres. Six centres participated in the dimensions 
perceptual evaluation, visual evaluation, aerodynamic 
measurements and self-assessment rating; four centres 
also participated in the voice recordings. Until now, 
statistics on the international data comprised correlation 
and variance analysis. 


Ill. RESULTS 


A. perceptual evaluation scale 

Inter judge agreement, as measured on the ratings of the 
102 voices recorded in Ghent, is good for semi- 
professionals and excellent for professionals [8]. 

B. acoustic analysis 

Properly defined acoustic parameters derived from the 
auditory analysis seem to demonstrate the following 
(average) ordering of voices according to their over-all 
quality: (i) normal voicing followed by (ii) voicing with 
one vocal fold, (iii) TE voicing and (iv) E voicing [2]. 
However, the demonstrated differences between TE 
voices and E voices are rather small. 

C. multicenter trial with the ELS protocol adapted for SV 
We collected 102 files (16 female, 85 male, 1 
unidentified) from which 2 were not further specified 
and 4 did not concur with the definition of SV. The 
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distribution of the 96 remaining samples categorized 
according to 5 main surgery types was: 11 fronto-lateral 
laryngectomy/tucker; 11 total laryngectomy with 
myotomy, 15 total laryngectomy without myotomy 
and/or with or without pharyngectomy or reconstruction; 
22 cricohyoido(epiglotto)pexy; 37 cordectomy (from 
type III on). This population is largely different from the 
population recorded in Ghent (the people recorded in 
Ghent were mainly TL). 

Perception: An analysis of the correlations between 
perceptual parameters showed that the highest values are 
found between ‘General Impression’ and ‘Voicing’ 
(r=0.83) and between ‘Impression of Intelligibility’ and 
‘Fluency’ (r=0.86). 

Perceptual parameters and type of surgery/anatomical 
structure: Variance analysis is significant for all 
perceptual parameters (IINF) except for Vo and the type 
of surgery. This significance is mainly due to the lower 
scores of the TL group (with or without myotomy), 
except for the parameter Noise where CHEP scores 
worst. Regarding the perceptual parameters and the 
‘main vibrating anatomical structure’, vibration at the 
esophageal segment scores worst. Fig. | gives an example 
of the perceptual parameter ‘impression of intelligibility’ 
and the main anatomical vibratory source. This is in 
agreement with the former acoustic analyses based on 
the AMPEX model. 

Aerodynamic measurements: 

Variance analysis demonstrates a significance level for 
MPT and i) Type of Surgery (p=0.03) and ii) main 
anatomical structure producing vibration (p=0.0003). 
The level of significance for Maximal Intensity is 0.0004 
for Type of Surgery and 0.0034 for Main anatomical 
structure producing vibration. Further analysis 
demonstrates a significant difference between 1) 
cordectomy and i) TL (p=0.046) and ii) CHEP 
(p=0.0002), 2) TL with myotomy and CHEP (p=0.009), 
3) CHEP and Tucker/FL (p=0.008). 

Self-assessment: There is no correlation at all between 
the patient’s perception of his /her disability and 
perceptual parameters, MPT, quality of vibration. 
Acoustics: As the number of TL was not in proportion to 
the amount of cordectomies and as the AMPEX model 
was initially trained on the Ghent database (which 
mainly consisted out of TL-files), we added 19 
additional files from the former Ghent database to the 
international database before performing calculations. 
The same 8 acoustic parameters as in the Ghent study 
were extracted from the text passages. Through linear 
combination of the acoustic features we designed a 
regression model and applied it on 4 of the 5 subsets 
(IINFVo), predicting the 5% subset. This was then 
compared to the subjective ratings of the clinician. 
Pearson correlation and standard deviation (between the 
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prediction and the clinician’s score) were only moderate 
(Pearson < 0.62; standard deviation >1.8). Fig. 2 
demonstrates the values for the predictive scores and 
subjective scores for ‘impression of intelligibility’. 


IV. DISCUSSION 


The IINFVo rating scale seems to constitute a reliable 
tool for the perceptual assessment of substitution voices 
and could form a viable alternative to the GRBAS scale. 
In contradiction to the original reliability score 
conducted on the Ghent data, the analysis of the 
international data demonstrates that correlation between 
the two “Ps is sufficiently low (r=0.7) not to discard 
anyone of them. As the first I includes all features and 
reflects a general appreciation of the voice which is 
similar to the definition of Filter and Hyman, this is in 
agreement with their statement that ‘Intelligibility’ and 
‘Acceptability’ only share 45 % common variance and 
thus advocate including both in a research design [10]. 
The highest correlations are now found between 
‘Impression of Intelligibility’ and ‘Fluency’ (0.86) and 
between ‘General Impression’ and ‘Voicing’ (0.83). The 
latter supports the theory that SV are perceived as 
qualitatively better (General Impression) when speech is 
voiced and unvoiced where it is supposed to be voiced or 
unvoiced. The high correlation between ‘Impression of 
Intelligibility’ and ‘Fluency’ may support the theory that 
Intelligibility is mainly determined by the voicing length 
and fluent speech production and to a lesser extent by 
voicing itself. Variance analysis suggests that perception 
can differentiate surgery type and vibration source. 
Further analysis reveals that this is mainly due to the 
worst scores for TL patients on the parameters IIFVo 
and the worst scores for the CH(E)P patients on the 
dimension Noise. 

There is only a low agreement between the perceptual 
evaluation by professionals and the self-evaluation of the 
patient’s voice (the highest is for ‘Voicing’: 0.46). 
Together with the fact that there is no correlation at all 
with the perceived disability, our data could, 
surprisingly, suggest that oncology patients mainly 
suffer from other co-morbidities (e.g. dysphagia, 
existence of a stoma) or psychological distress. 

The AMPEX acoustic analysis seems capable of 
differentiating between various SV types [2]. 
Preliminary results in this trial however, show only a 
moderate agreement between the predicted scores and 
the subjective rating. There can be various reasons for 
the low concordance. First, the AMPEX model was 
formerly trained on mainly laryngectomy speech. The 
fact that there are far more cordectomies and partial 
laryngectomies in the international trial can induce 
errors. Secondly, the subjective ratings were performed 
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by only one clinician. Although the clinicians had the 
availability of reference speech samples for each 
acoustic feature, there were no real training sessions 
preceding the rating. For this, we will compare the 
subjective scores of the clinician with our personal rating 
and additionally with an independent jury. If the 
concordance with these last ratings and the AMPEX is 
substantially better, we advocate an intensive training of 
the IINFVo scale, eventually through developing an 
audio CD in several languages. This of course implies a 
large database. 
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Figure 1: gives an example of the perceptual parameter 
‘impression of intelligibility’ and the main anatomical 
vibratory source. 
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Figure 2: concordance between the acoustically 
predicted scores and the perceptual evaluation, for the 
parameter “impression of intelligibility” 
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PITCH CONTOUR FROM FORMANTS FOR ALARYNGEAL SPEECH 


M. Hagmiiller 
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Abstract: People without a larynx have to commu- 
nicate with a substitute voice, where all possibilities 
have major shortcomings. A common problem is the 
lack of a natural pitch contour. This is either because 
of a constant pitch in the case of an electro-larynx or 
very limited control over pitch in case of esophageal 
or tracheo-esophageal speech. To introduce a more 
natural pitch contour, we propose to use the speech 
formants as a source to generate an artificial pitch 
contour. Earlier offline methods to introduce a natural 
pitch contour to AL speech have shown, that this 
significantly improves the speech quality. For voice 
modification, we use the voice pulse model which 
enables us to place voice pulses at arbitrary positions 
even when no harmonic voice source is available, while 
preserving the voice identity of the speaker. Informal 
perceptual evaluations showed that using a formant 
based fundamental frequency yields a reasonable pitch 
contour and is perceived as an improvement for 
alaryngeal speech. 

Keywords: Alaryngeal speech, pitch, formants, en- 
hancement 


I. INTRODUCTION 


In case of laryngeal cancer at an advanced stage the 
last possibility to stop the further advancement of the 
cancer, and therefore save a patients life is to remove the 
entire larynx. This results in the loss of the usual voice 
production mechanism, based on vibration of the vocal 
folds. In addition the trachea is surgically moved to an 
opening at the neck, called the tracheostoma. As a result, 
the air is not passing through the vocal tract anymore. 

Alaryngeal patients have then to rely on a substitute 
voice production mechanism. There are three major meth- 
ods available: 


Electro-larynx (EL): A hand held device, which is held 
against the neck, produces a buzz-like sound and 
excites the vocal tract. Through usual articulation 
movements the sound source is modulated and 
voiced speech sounds can be formed. Unvoiced 
sounds are produced as in healthy speech by using 
the air reservoir available in the mouth. 

Esophageal voice (ES): Air is swallowed into the 
esophagus and is released again in a controlled 
manner. The false vocal folds are then excited and 
produce the source of the speech sound. In healthy, 
laryngeal speech, the false vocal folds are not used 


for phonation, so the patient has to be trained to use 
this substitute voice source. 

Tracheo-esophageal voice (TE): A valve between the 
trachea and the esophagus is surgically inserted. The 
valve allows to let the air from the lungs t ow through 
the vocal tract if the tracheostoma is closed when 
exhaling. The air t ow through the pharynx excites 
the false vocal folds and some kind of oscillation is 
produced. 

All three of those voices have major shortcomings. The 
electro-larynx voice sounds very mechanical, due to the 
monotonous sound, which is strictly periodic at a constant 
pitch. The methods which excite the false vocal folds (ES 
and TE) have very unstable oscillation patterns, which 
result in a very rough voice. Therefore the fundamental 
frequency (Fo) cannot be reliably extracted by means of 
digital signal processing methods. Further, the oscillation 
cannot be controlled very well, which leads to an incon- 
sistent pitch contour. To improve the voice one approach 
would be to introduce a more natural pitch contour, but 
this prosodic information has to be derived from feature 
other than the fundamental frequency, since this is either 
constant, or not measurable. The esophageal voice also 
suffers from timing problems, because the amount of 
air that can be swallowed limits the duration of speech 
phrases considerably. 

This paper will Trst look into related work concerning 
alaryngeal voice enhancement and prosody in alaryngeal 
speech. Then we will present a method to introduce a 
pitch contour derived from the speech formants. 


II. BACKGROUND AND RELATED WORK 
A. Alaryngeal speech 


Previous approaches were introducing an artiTcial voic- 
ing source as a substitute for the bad voicing source of 
alaryngeal speech [2], [1]. The voicing source is based 
on voicing models such as the Liljencrants-Fant (LF) 
[5] model. A different approach for voicing substitution 
would be to use prerecorded voice samples of sufTcient 
length to avoid audible loops. Most beneTt for the patient 
is achievable in case it was possible to do extensive voice 
recordings prior to surgery and even better prior to the 
voice degradation, which in most cases is already severe 
in case of laryngectomy. This would be the best way 
to preserve the voice identity of the speaker after the 
operation [7]. 

In [13] the prosody of alaryngeal speech has been in- 
vestigated. Experiments have been performed to solve the 
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question whether alaryngeal speakers are able to convey 
prosodic information without being able to produce an 
Fo contour. It was shown that alaryngeal speakers do 
convey prosodic information, which can be interpreted by 
a listener. Further, it was investigated which cues com- 
municate the pitch-like information. Features considered 
were high frequency intensity and spectral tilt. A majority 
of the alarygeal speakers were able to convey accent 
information, without using pitch cues. The study did not 
show, though, which features are used for this ‘alternative’ 
pitch. 

Meltzer et al. [10] performed a study to Tnd out which 
type of enhancement brings the most improvement to 
electrolarynx speech. They investigated combinations of 
low-frequency boosting, noise-reduction and natural pitch. 
Listening tests were used to determine the preferred modi- 
Tcation method. For the natural pitch experiment the same 
sentence was uttered by one person with healthy speech, 
Trst using the vocal cords and then holding the breath and 
using an EL. The natural pitch contour was then applied 
the EL speech utterance. Earlier Ma et al. [9] have also 
performed similar experiments. Both experiments show 
that the substitution of the monotonous EL pitch with 
a natural pitch contour signiTcantly improves EL speech. 
For hoarse speech, we have previously shown, that a pitch 
contour in the expected frequency range a male speakers 
improves the perceived quality of disordered speech [6]. 

Recent approaches have used energy as a feature to 
provide a pitch contour for generating voicing for whis- 
pered [11] and esophageal speech [8]. While the energy 
contour may provide reasonable results for whispered 
and ES speech, it does not make sense for EL speech. 
Electro-larynx speech is very limited in expression and 
energy modulation is not possible or only very limited. 
Commercially available ELs, if at all, do only provide 
two intensity positions. A high energy position has to be 
activated manually by pressing a button. Therefore, other 
features have to be used to calculate a pitch contour from 
the speech signal. Taking the formants as a source for the 
pitch contour seems to be a useful approach. 


B. Radiated Voiced Pulse Modeling 


A Trst approach to pitch modiTcation is of course 
the TD-PSOLA approach [12]. While this works for EL 
speech, it cannot be used for ES and TE speech, since no 
reliable pitch mark estimation can be performed. If we 
want a pitch modiTcation system to serve as a framework 
for alaryngeal speech in general, not only EL speech, we 
need a different approach. We have chosen the radiated 
voiced pulse modeling approach by Bonada [4], which is 
briet y described below. 

If the input signal y[n] consists of the sum of R 


identical input signals x[n] which are delayed by multiples 
of An, 


y[n] = z[n]+a[n—An]+a[n—2An]+...+a[n—(R-1)An] 
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Fig. 1. Top: Amplitude of FFT of electro-larynx speech. 
Bottom: Phase of FFT. 


then, after some calculation, which can be found in [4], 
we Tnd that 


l ges Y 1 sin(0.50AnR) 
Y (ei?) = X (ej? jaan sin( = 
nd Ce : sin(0.50An) 


= X (e?) sincr(QAn). 


The effect of sinc term is that the spectrum of X (e’) is 
sampled. If we assume that the X (e’°°) only varies slowly 
we can estimate X'(e%%) from Y (ef?) by interpolating 
the harmonic peaks (see Tgure 1). The full derivation of 
this assumption including how the phase is dealt with can 
be found in [4]. 

So if the harmonic peaks are interpolated and trans- 
formed back into the time domain, one can reconstruct the 
voice pulses, which were Tltered by the vocal tract and 
radiated by the mouth (see Tgure 2). The reconstructed 
pulses can be placed at arbitrary positions, similar to TD- 
PSOLA [12], while introducing the possibility of complex 
modiTcations. Another advantage of the voiced pulse 
model is, that no pitch marks are needed for the analysis 
of the signal. 

The above method formed the basis of an enhancement 
approach for esophageal speech [8]. Since in esophageal 
speech harmonic peaks cannot be reliably determined 
the spectral envelope is determined by using a bank 
of constant bandwidth Titers. The phase is derived by 
smoothing, shifting and scaling the magnitude envelope 
of the spectrum. 

The next section will describe the procedure, how the 
enhancement system for electro-larynx speech is imple- 
mented. 
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Bottom: Windowed frame (y axis shifted -1.5 for better 
visibility). 


III. DESCRIPTION THE ALGORITHM 


While the system is intended to be working in real 
time, at the current stage it is implemented in MatLab, 
by loading wav Tles, which are then processed frame-by- 
frame. The proposed algorithm works with a sampling 
frequency of 16kHz, so if necessary the sound Tle is 
resampled. After a high pass Tlter which removes DC and 
very low frequency components, the pitch of the speech 
utterance is tracked with Praat. Since the pitch is usually 
constant for EL speech, the processing frame length is 
Txed to 3 times the pitch period. The pitch tracking can 
be omitted once the fundamental frequency of the EL 
is known, or in case of ES or TE speech, where pitch 
determination will not work reliably. 

The following steps are performed frame-wise: The 
signal is transformed to the spectral domain, by cal- 
culating an FFT.The spectral envelope is derived from 
interpolating the peaks,which in case of EL speech will 
be the harmonics. The phase is calculated as proposed in 
[8], by smoothing, scaling and offsetting the interpolated 
magnitude envelope. 

Since for EL speech energy modulation is very limited, 
the generation of the pitch contour relies on the formants. 
The formants are tracked with the algorithm provided 
by the Praat speech software [3]. At the current stage 
different methods to calculate the pitch contour from the 
formants have been tried and informally evaluated. The 
smoothed difference between the 1% and 24 formant has 
been chosen to generate the pitch contour: 


folt) = smooth(F; (t) — Fo(t))/a + 6, (1) 


where F(t) and F(t) are the Trst and second formant 
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Praat. Bottom: Original electro-larynx pitch contour and 
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and a and 6 are constants for offsetting and scaling. They 
have to be chosen to match the target average pitch and 
the pitch range of the patient. Voicing is switched on and 
off with a simple energy based voice activity detector 
(VAD). The voice pulse model enables us to place the 
voiced pulse at arbitrary positions, i.e. at the pitch marks 
derived from the pitch contour determined using Eq. 1 

To avoid the int uence of other enhancment methods 
for the evaluation of the improvement due to a pitch 
contour, only the pitch was modiTed. Informal percepual 
evaluation has been performed with Tve listeners using 
sound samples from EL, ES and TE speech. A clear 
preference has been indicated for the modiTed speech 
utterances using the proposed pitch modiTcation. The 
negative effect of unexpected pitch movements — within 
a certain boundary — is compensated by the existence of 
a reasonable pitch contour at all. 


IV. CONCLUSION 


One of the major shortcomings of electro-larynx speech 
is the lack of a normal pitch contour. Previous publica- 
tions showed that adding a natural pitch contour was the 
most important modiTcation to improve the perceptual 
quality of EL speech. We presented an approach that 
exchanges the monotonous pitch with a more natural Fo 
contour. While it may not necessarily be linguistically cor- 
rect at all times, it does improve the perceived naturalness 
of the speech and reduces the impression of robot-like 
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Fig. 4. Block diagram of electro-larynx enhancement 
system based on voiced pulse modelling. 


characteristics especially for EL speech. 


Further work 


Further research on correct prosody has still to be 
carried out and is expected to yield an improved under- 
standing of how ’pitch’ accent is conveyed in alaryngeal 
speech. 

A further shortcoming of EL speech is, that there is 
no appropriate distinction between voiced and unvoiced 
sounds. At the moment the whole speech utterance is 
treated as voiced. A step further would be to introduce 
a distinction between voiced and unvoiced and treating 
them accordingly. A V/UV classiTer is needed which is 
able to correctly label EL speech. Then unvoiced sounds 
can be left unmodiTed. 

Further work is necessary to improve the sound quality, 
while preserving the identity of the voice. This includes 
noise reduction to suppress the directly radiated noise. 
This is the energy which is omitted from the EL directly 
in the air and is not modulated by the vocal tract. 
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Newborn infant cry 


ANALYSIS OF NOISE IN CRY SIGNAL USING FREQUENCY AND TIME- 
FREQUENCY TOOLS 
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Abstract: An acoustic cry analysis using frequency and 
time-frequency tools is developed with the objective of 
obtain precise and significant information about the 
events that occur and to analyze the effect of noise on 
the signals. For hungry baby cry signals, spectrogram, 
coherence function and wavelet packet decomposition 
were used on sonorant and non sonorant segments 
and de-noising techniques were applied to remove the 
unwanted components. Results show that it can be 
useful to use those techniques to perform a better 
analysis of the signal and to eliminate the noise, but 
the final assessment has to come from the physician’s 
point of view. 

Keywords: Cry analysis, Spectrogram, 
Function, De-noising, Wavelet Packets 


Coherence 


I. INTRODUCTION 


Crying can be seen as a complex and dynamic 
biological phenomenon. It involves vocalization features, 
facial expressions and limb movements, all of which may 
vary over time. As a signal, crying can be considered as a 
set of functions that characterizes the acoustic events that 
it comprises and the resulting analysis happens to be quite 
complex. Some components appear as a consecutive set 
of pure tones in a certain type of arrangement while 
others are a mix of noise components and tones, all of 
them happening in different periods of time. 

Efforts in crying research to look for possible 
relationships between the cry and the condition of the 
subject, mainly babies and infants have been addressing. 
For example, some studios aim to the effect of age on the 
measurement of pain in babies by the acoustic analysis of 
the signal [a] while others are focused on use the crying 
as an additional means of finding the degree of brain 
damage in children with malnutrition [b]. The classic tool 
used to crying analysis has been the sound spectrogram, 
but in recent years sophisticated frequency and time- 
frequency techniques have been proved to be efficient on 
the acoustic analysis of signals as speech and music. 

This paper presents acoustic cry analysis using 
frequency and time-frequency analysis tools in order to 
obtain precise and significant information about the 
events and to analyze the effect of noise on the signals. 
Compared to the spectrogram, the coherence function and 
the wavelet packet type of time-frequency representation 
are considered. 


II. METHODOS 


In this section data aspects and frequency and time- 
frequency methods are presented. The complete analysis 
was developed in Matlab. 


A. Data 


Crying signals are difficult to record; the environment 
conditions does not allow to control the noise that arise 
from the people’s voice, especially the mother, the 
incidental sounds of the surroundings and even the lack 
of cooperation of the baby. It has to be noticed that the 
conditions under recording are uncomfortable for 
him/her. Long recording sessions (up to a minute) are 
suitable for the analysis, however it is not always possible 
to achieve; and the classification of the cry is not clear 
when a possible illness is considered. Baby cry signals 
taken from [c] were used for the analysis 


B. Spectrogram and Coherence Function 


Spectral signal analysis focuses on the relationships 
among frequency components. Historically the sound 
spectrogram has been the major tool for analyzing the 
acoustics of cry [d]. The spectrogram algorithm usually 
splits the signal into overlapping segments by applying a 
fixed-length window (N). For each segment the discrete- 
time Fourier transform is calculated, in order to produce 
an estimate of the short-term frequency content, or 
spectra. The algorithm is repeated iteratively and the 
spectra are collected to form the spectrogram which is 
computed from 


r,(0)=27 Y (cs (0-47) (1) 


where I, (0) is the power density spectrum for a 


periodic signal y(n), while Cx are the associated 
coefficients [e]. The set of parameters used is the 
following: 

- sampling frequency = 22050 KHz 

- nfft=256 

- hamming window of nfft length 

- no overlap between windows. 
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The coherence function (CF) is a measurement of the 
linear dependence between two signals as a function of 
frequency. It is defined in terms of power spectral 
densities and the cross spectral density by 


(2) 


where Ro (w), R i (w) are autocorrelation functions of 


processes X and Y, and Ry (w) is the cross correlation 


function between both processes [f]. It evaluates how 
correlated X and Y are in each frequency; the highest 
coherence function points to a very important presence of 
this frequency in the signals to be analyzed. CF was 
obtained from a set of 256 coefficients that covered the 
frequency bandwith with the use of a Hamming window 
with no overlap. 


C. Time-Frequency decomposition and de-noise 


Time-Frequency representations (TFR) allow to see the 
behavior of a signal in both domains at the same time; 
although the spectrogram is a kind of TFR, it is based on 
sine and cosine functions and has important constraints in 
terms of solving sudden changes or other events. 
Wavelets are special functions that can provide the TF 
resolution needed. The wavelet functions can be 
converted in packets (WP) by setting its time and 
frequency values and a minimal representation of the data 
can be obtained by calculating the "best basis", a set of 
WP selected from applying a particular cost function. The 
"best basis" is used in applications that include noise 
reduction and data compression. 

The model for the noisy signal is basically of the 
following form: 

s (n) = f (n) + oe (n) 5 
where time n is equally spaced. The basic model 
supposes that e(n) is a Gaussian white noise N(0,1) and 
the noise level is supposed to be equal to 1. The de- 
noising objective is to suppress the noise part of the 
signal s and to recover f. De-noising can be accomplished 
via wavelet-based shrinkage methods. These techniques 
use wavelets to transform data into a different basis [g]; 
large coefficients correspond to the signal, and small ones 
represent mostly noise. The de-noised data is obtained by 
inverse-transforming the suitably thresholded, or shrunk, 
coefficients. The threshold can be hard or soft. Hard 
thresholding can be described as the process of setting to 
zero the elements whose absolute values are lower than 
the threshold. Soft thresholding first sets to zero the 
elements whose absolute values are lower than the 
threshold, and then shrinks the nonzero coefficients 
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towards 0. Eight degree Daubechies wavelet was used for 
a five level WP decomposition with Shannon entropy as 
cost function. De-nosing was performed by using a sparse 
norm setting a soft threshold level of 0.4345. 


HI. RESULTS 


Coherence function (CF) was applied to several cry 
signals in order to get those frequencies highly related, 
but in most of the cases very low values were found in 
the complete frequency domain. Signals were segmented 
by hand and sonorant and non sonorant segments were 
isolated and analyzed with the CF. Tables 1 and 2 show 
the resulting values for a hungry baby cry signal. It can 
be seen that sonorant segments show more correlation 
than non sonorant ones, although they still have low 
values; this behavior was repeated in all the crying 
classes. When low values in CF results are found, several 
reasons can be addressed. The presence of uncorrelated 
noise on the signals is a possibility due to the different 
conditions that makes difficult to record a baby crying; 
environment components as the sound of the mother's 
voice, the baby movements or the record procedure 
affects the quality of the signal obtained. 

In order to assess any possible noise effect, the 
spectrogram of a non sonorant segment and its WP 
decomposition were obtained. This segment was de- 
noised using a Daubechies wavelet and the resulting 
recovered signal was obtained. To compare the 
difference, spectrograms of both, original and de-noised 
signals, were also obtained. Fig.1 shows the hungry baby 
cry signal, a non sonorant segment and its de-noised 
version spectrograms, as well as their time versions. It 
can be seen how the most predominant components are 
kept in the de-noised signal, but the decomposition also 
shows high frequency components that are absent on the 
original signal. In order to visualize the eliminated 
components, that in certain way can be considered as the 
noise embedded, the spectrogram of the de-noised and the 
residual segments are shown in Fig. 2. It is observed that 
there are high energy components present in both signals 
and high frequency noise in the residual. 


Table 1. Maximal coherence function value of sonorant 
segments in a hungry baby cry signal 


Sonorant Frequency Coherence 
segments Function value 
(sor 1895 0.3239 
1 =3° 1895 0.4644 
1*— 4" 8355 0.1238 
1*— 5" 861 0.1248 
1*— 6" 8355 0.1938 


Newborn infant cry 


Table 2. Maximal coherence function value of non 
sonorant segments in a hungry baby cry signal 


Non Sonorant | Frequency Coherence 
segments Function value 
= 2584 0.1872 
13% 775 0.2223 
ag 9474 0.1121 
15% 0 0.3237 


Fig.1 De-noising analysis of a hungry baby cry signal: 
a) spectrogram of the complete signal, b) original and de- 
noised non sonorant segment, c) non sonorant segment 
spectrogram, d) de-noised non sonorant segment 
spectrogram 


Fig.2 Spectrograms of a) de-noised and b) residual of 
a non sonorant segment from a hungry baby cry signal 
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IV. DISCUSSION 


To assess noise presence in a signal where high 
frequency components are present is not an easy task. It 
is common to find some noised components assigned to 
the crying signal and in the spectrogram is not clear 
where the boundary between both components is. The use 
of a de-noise technique would allow to identify those 
noisy components. A noise eliminating method was 
applied to a segment of crying signal and the resulting de- 
noised version presented some high frequency 
components that apparently were absent on the original 
signal; it can be seen it as a confirmation of the presence 
of a noise that covered those components, but more 
experimentation must be done in order to have certitude. 

The selection of the best basis function as well as the 
de-noising settings is a problem to contend with in this 
application. In the experiments an eight order Daubechies 
wavelet was used because it presented adequate 
properties like to be compactly supported or to have the 
capability of getting an exact reconstruction; some other 
wavelets, as symmlet or coiflet, have the same features 
and their shapes result interesting to use in the analysis of 
a non sonorant component. Shannon entropy was applied 
to obtain the WP decomposition and a sparse 
thresholding method set the de-noise procedure; some 
other WP de-noise configurations are possible. Although 
time-frequency analysis shows important improvements 
on the visualization and processing of the signals, the 
spectrogram representation is still a useful tool for the 
detection of trends on the components and, as in this case, 
to compare the de-noised signals to the original ones. 


V. CONCLUSION 


Frequency and time-frequency analysis were carried 
out on non sonorant segments of crying signals. 
Spectrogram and coherence function showed the presence 
of noise components and the use of a de-noised method 
based on the wavelet packet decomposition was applied. 
Results show that it can be useful to use both techniques 
to perform a better analysis of the signal. Wavelet 
decomposition and de-noising techniques have proved to 
be effective in eliminating unwanted components, but the 
final assessment has to come from the physician’s point 
of view. In order to achieve a suitable generalization of 
the techniques, the data set must be augmented and 
reference signals have to be considered. 
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Abstract: Crying is a physiological action made by 
the infant to communicate and to draw attention. 
However, especially for a premature infant, this 
action requires great effort, which may even have an 
adverse impact on blood oxygenation. In this work 
we present first results concerning the evaluation of 
the distress occurring during cry, as related to 
possible decrease of cerebral oxygenation. A 
recording system has been developed, that allows 
synchronised monitoring of the central blood 
oxygenation and the audio recording of newborn 
infant’s cry. 

A multi-purpose voice analysis tool (BioVoice), 
characterised by high resolution and tracking 
capabilities, is applied to new-born infant cries. For 
these signals, the tool provides also detailed statistics 
(min and max cry length, maximum energy, etc.), to 
help diagnosis. BioVoice is completely automatic, 
working with any sampling frequency and Fo, and 
does not need any manual setting of whatever option 
to be made by the user, thus being easily accessible 
also to non-experts. 

Some examples are reported, concerning preterm 
new-born infants. 

Keywords: newborn cry, blood oxygenation, voice 
analysis. 


I. INTRODUCTION 


Infant monitoring in neonatal critical care units is a 
common procedure in clinical practice. The cerebral 
blood flow in preterm and full-term newborn infants has 
been studied extensively, as newborn infants have an 
impaired auto regulation of the cerebral blood flow. 
Irregularities in the blood flow and pressure may 
adversely influence the growth of the child. Some 
studies have been performed in order to evaluate the 
blood flow and oxygenation in the newborn by Near 
Infrared Spectroscopy (NIRS), also as linked to other 
techniques [1]-[6]. 

In newborn infants, one of the most common events that 
may affect the respiratory flow is related to cry. Crying 
is a physiological action made by the infant to 
communicate and to draw attention. It involves 
coordinated actions of many muscles of abdomen, chest, 
throat and head. This apparatus is obviously controlled 
by the central nervous system (CNS). 

Specifically, for a premature infant, crying requires 
great effort, which may cause distress. Also, preterm 


and/or low-birth-weight infants often present respiratory 
problems, ranging from insufficient ventilation to 
apnoea, and hence crying implies an effort which may 
have an adverse impact on blood oxygenation. Acoustic 
analysis of new-born infant cry signals is thus of 
importance, as a precocious aid to clinical evaluation of 
several CNS pathologies. Being easy to perform, cheap 
and completely non-invasive, it can be successfully 
applied in many circumstances [8]-[14]. A robust high- 
resolution software tool is proposed here, to track main 
acoustic parameters of newborn cry. 

Possible relationship among some cry parameters and 
distress is investigated, as related to the decrease of 
cerebral oxygenation. To this aim, a new recording 
system has been developed, that allows synchronised 
monitoring of the central blood oxygenation and the 
audio recording of infant’s cry emissions. 

Preliminary results on a data set of 9 preterm infants 
indicate that in some cases the effort in crying is 
associated with a noticeably decrease in the oxygenation 
level during a cry episode. 


II. METHODS 


Central blood saturation has been measured with NIRS 
device (somasensors by INVOS 5100C Somanetics 
Corp.), that allows for acquiring 1 sample each 5s. A 
unidirectional microphone (Shure SM58), equipped 
with US-144 portable audio / MIDI interface (96 kHz / 
24-bit recording) has been used to record cry emissions. 
Audio recording was performed using a multimedia 
notebook which acquired a single channel audio track, 
with a sampling rate of 44 kHz and 16 bit resolution. 
Specific software has been designed and implemented, 
to allow synchronization with the NIRS device, using a 
digital output linking the laptop with the input of the 
NIRS instrument. The software performs a simultaneous 
recording of the audio channel trough the US-144 board 
and of the NIRS signal using a RS-232 connection. The 
NIRS signal is composed of up to four independent 
channels, each made up of two data, one containing the 
relative saturation of oxygen, and the other representing 
the quality of the signal, which can be useful to detect 
possible artifacts related to patient movement or poor 
contact of the sensor with the patient. 

Due to different sampling rates for NIRS and for audio 
signals, the range for audio analysis is adjusted to the 
nearest second in the corresponding NIRS recording. 

As for audio signal analysis, a multi-purpose voice 
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analysis tool (BioVoice) allows for new-born infant cry 

analysis, performing Fo, noise and resonance 

frequencies tracking, on signal frames of varying length 

(even few ms), adaptively tailored to varying signal 

characteristics. Details are given below. 

Fundamental frequency Fo - Newborn infant cry is 

characterised by high fundamental frequency Fo 

(>300Hz), with abrupt changes and voiced/unvoiced 

features of very short duration within a single utterance. 

For analysis, the signal is divided into short frames, 

whose length adaptively varies according to varying 

signal characteristics: the higher the Fy the shorter the 

frame length (kept fixed to 3 pitch periods). A 

voiced/unvoiced (V/UV) separation algorithm is 

implemented, to avoid Fọ estimation on signal frames 
that have no harmonic content, where misleading results 

could be obtained [7]. 

Fo tracking is achieved by means of a two-step 

procedure, based on well-established results: the AMDF 

approach is applied to a wavelet-smoothed SIFT 
estimation of Fo, with optimised and varying adaptive 

filter order [8]-[10]. 

Resonance frequencies F; - Even if vowel frequencies 

cannot be found in newborn cry, RFs reflect important 

acoustical characteristics of the vocal tract of the infant. 

Robust and high-resolution RF estimation is 

implemented, based on parametric AutoRegressive 

(AR) PSD evaluation. The AR model order p is 

automatically selected by the program, according to the 

relationship: p=2LF,/c, where: F,=sampling frequency, 

L=vocal tract length, and c=sound speed [9]. 

The BioVoice tool is provided with a user-friendly 

interface (Fig. 1) that allows selecting age, sex and type 

of vocal emission for each patient, performing 
computations without any other requirement. The tool 
automatically adjusts internal settings for optimal frame 
length, frequency range of analysis and_ plots. 

Specifically, the interface allows for: 

— selecting data (.wav files); 

— choosing the voice type, ranging from high-pitched 
new-born and singers voices to adult voices: the 
overall allowed Fy range is 40Hz<Fy<1300Hz; 

— selecting the kind of analysis: single audio file or two 
files (for comparison purposes). 

A notice is added concerning computer time required: 

for long files (>5s) and high sampling frequency (>40 

kHz) the total time could approach 5min in total. A 

moving bar shows the residual time during 

computations. 

A number of ad hoc plots and tables is displayed and 

saved in printable format, for a visual comparison of 

results. Specifically, for infant cry, Fo, V/UV frames, 
spectrogram, resonance frequencies are plotted, all in 
coloured map. Some tables summarise mean, std, max, 
min values for Fo and F;-F3, as well as cry length and 
the corresponding maximum energy. These parameters 
are in fact considered among the most meaningful in 
newborn cry analysis [8]-[12]. 
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Figure 1 — The user interface for acoustic analysis 
II RESULTS 


Infants were selected by physicians among patients at 
the Critical Care Unit of the Children Hospital A.Meyer, 
in Firenze, Italy. The analysis has been carried out on a 
group of 9 preterm infants, having a pregnancy period 
ranging from 23 to 38 weeks and a weight at birth 
between 590g and 3020g. No relevant pathology was 
found among the analysed infants. An example is 
reported, concerning a newborn infant with pregnancy 
period of 38 weeks and a birth weight of 3020g. The 
birth was spontaneous. A suspected congenital 
cardiopathy has been pointed out by clinicians. 

From our previous studies, the effect of crying seems 
much larger on central blood saturation than on 
peripheral saturation. Moreover, tracking the saturation 
level pointed out an increase of saturation after the 
episode, which means that the nervous system tries to 
compensate the loss of oxygen due to crying [14]. 

The example reported, though relative to an almost full- 
term newborn, allows pointing out possible distress due 
to crying, as evidenced by some voice parameters 
(mainly cry length, FO melody and RFs), corresponding 
to a drop in oxygenation levels. 

For printing reasons, we report here only a subset of the 
available figures, in a grey scale. 

Figure 2 shows the plot of the NIRS values (% as 
referred to saturation) for about 27min of recording, 
extracted form a longer period. Actually, the new tool 
allows for simultaneous recording of both NIRS and cry 
on a range of several hours. 

As shown in the figure, a remarkable decrease of RO2 
occurs around the time instants 0:08:35, 0:15:35 and in 
the interval 0:21.15-0:24:15, all corresponding to cry 
episodes, automatically marked by the software. 
Specifically, the interval 0:23:01-0:23:04 is considered 
here, and indicated by the arrow in Fig.2. Fig. 3 shows 
the V/UV parts of the cry episode, as found by the 
BioVoice tool. An UV segment was found in the range 
(1.58s-2.3s). 


Newborn infant cry 


Table 1 reports the information about V/UV segments 
of the cry that could be of relevance for diagnosis. 


TIME (hom :3) 


Figure 2 — NIRS tracking with a marker for the cry 
episode in the interval 0:23:01-0:23:04 


Voiced and unvoiced parts of the signal 


Normalized peak values 


1.5 
Time [s] 


Figure 3 — V/UV parts of the cry signal 


TABLE 1 — V/UV characteristics 
*VOICED PARTS 


Star End Total 
0.020s 1.8208 1.800s 
2.3405 2.6205 0.2805 


Max duration = 1.8003 ; Min duration = 0.280s ; Mean duration = 1.040 
Total duration = 2.060s ; Number Parts = 2 


217 


Fig. 4 shows Fo tracking, performed on the voiced parts 
of the signal only. Fo is characterized by almost regular 
rising and falling shape, typical of the newborn infant 
cry melody. However, notice shorter time duration of 
each utterance (<1s), and lower FO mean value, as 
compared to healthy cry [9]-[14]. 

In Fig. 5, the spectrogram with the tracking of the first 
three RFs superimposed is displayed. Notice the almost 
irregular shape for the RFs, the 3" one being almost 
unrecoverable. Moreover, RFs are set to lower 
frequencies with respect to the healthy cry [9]-[14], as 
shown in Table 2, where the maximum energy of the 
signal is also reported. This could be due to the still 
incomplete vocal tract structure in the newborn, as well 
as to his/her possible CNS dysfunction. 

The analysis also suggests that physiological 
compensation systems are not able to maintain the level 
of blood oxygenation during crying episodes. 


FO. AMIN - deen FO elit Side ATO 
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Figure 4 — Fundamental frequency tracking 


Spectrogram and resonance frequencies 
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Figure 5 — Spectrogram and resonance frequencies F1- 
F3 
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TABLE 2 — Summary of main statistics of Fy and for 
RFs F1-F3, along with maximum power. 


“FUNDAMENTAL FRENDUENCY 


Mean FO=353.4H7 ; Sti- 117.0 
Mar FO=918 SHI ; Mh Fo=148.4H7 


“RESONANCE F REO UENCIES 
Meat Fi = 790.2H7; StI Fi = 460.5 


Mean F2= 2784.7H7; Std F2 = 658.5 
Me ar F3= 4442.8H2; Std F3= 2079.1 


"POWER MAX = -3.078d6 


IV FINAL REMARKS 


First results have been presented, concerning the 
evaluation of the distress occurring during cry, as 
related to possible decrease of cerebral oxygenation. To 
this aim, the relationship among some cry parameters 
and the decrease of cerebral oxygenation is investigated. 
A synchronisation system has been developed, that 
allows simultaneously acquiring the central blood 
oxygenation and the audio recording of infant’s cry 
emissions. A new robust tool for new-born infant cry 
analysis is presented. Being completely automatic, the 
proposed software can be successfully used in a wide 
range of applications, also in case of highly varying 
signals, without requiring any manual setting to be 
made by the user. 

Preliminary results on a data set of 9 preterm infants 
indicate that in some cases the effort in crying is 
associated with a noticeably decrease in the oxygenation 
level during a cry episode and to abnormal cry 
parameters. 

Future work will concern adding more parameters for 
audio signals analysis, as well as further optimising 
existing ones. A data base is under construction, in co- 
operation with the Children Hospital A. Meyer, Firenze, 
Italy, with the aim of searching for possible correlations 
also among other signals, such as ECG and peripherical 
blood oxygenation, as a non-invasive aid to diagnosis. 
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Abstract: The paper assumes the implementation of a 
ery-based classifier for neonatal diagnosis. The main 
contribution is concerned with the articulated 
processing of cry signals, which includes two kinds of 
approaches: a threshold-based classification and 
ANN-based classification. Every one of those 
approaches makes its own contributions to the cry 
classification, both are adequately combined in a 
classifier of two-class (pathological and normal). 
Moreover the use of cry unit as a primary data was 
also an interesting aspect held by the authors. This 
articulated cry processing is the main body of a new 
cry-based methodology for neonatal diagnosis, which 
will be presented in a few months by the Group of 
Speech Processing in Cuba. 


Keywords: cry analysis, neural networks 


I. INTRODUCTION 


Since the use of new approaches like ANN's have been 
applied for cry classification the possibility to make a 
cry-based diagnosis in newborns has become in reality 
[1-3] [17]. In this paper the state-of-art in cry analysis 
and new focus of soft computing have been properly 
combined, leading up to a suitable articulated processing 
of the cry signals oriented for a neonatal diagnosis. As it 
is explained in the main body of paper five specific forms 
of processing are articulated in one: (1) a digital signal 
processing (acoustic cry parameters extraction and Mel 
frequency cepstral coefficient (MFCCs) estimation), (2) 
data management (BDLllanto: a Cuban corpus of cry 
signals), (3) principal component analysis (PCA), (4) 
neural network —based classification and (5) a threshold- 
based decision. 


II. METHODS 


The basis of the research work was based on the 
physioacoustic model for cry production and the Golub’s 
muscle control model. As it was mentioned above two 
classification approaches are properly articulated 2-in-1: 


(1) Threshold-based classifier: the threshold values of 
four cry features for normality are considered [5-9] [16]: 


Voicedness: the ratio of the amount of periodic sound 
versus the amount of noise. (the higher the voicedness, 


the weaker the noise component in comparison to the 
periodic sound). 

Melody: the performance of fundamental frequency over 
time, within one cry unit. 

Stridor: a rapid increase in air pressure causes the vocal 
cords to enter a turbulent state resulting in a sudden loss 
of pitch. 

Shift: a sudden large change in pitch 


The procedures for the computation of those attributes are 
the same suggested by Cano et al [17] in 2006. 


Cry Corpus. 

The cry samples were taken from a Cuban cry corpus 
named BDLLanto database (32 cases: 16 healthy children 
and 16 pathological children). The database includes a 
friendly user interface, which let the user manage 
acoustical and clinical information of newborns in an 
efficient manner. It also incorporates some features of 
Web technologies for Internet facilities. 


(2) ANN-based classifier: it consists on a feed-forward 
network using the method of scale gradient conjugate 
(MSGC) as learning algorithm The input vector is 
composed by the Mel frequency cepstral coefficients 
(MFCCs) [4] [11-13] 


Mel Frequency Cepstral Coefficients. 

The low order cepstral coefficients are sensitive as 
overall spectral slope and the high-order cepstral 
coefficients are susceptible to noise. This property of the 
speech spectrum is captured by the Mel spectrum. High 
order frequencies are weighted on a logarithmic scale 
whereas lower order frequencies are weighted on a linear 
scale. The Mel scale filter bank is a series of L triangular 
band pass filters that have been designed to simulate the 
band pass filtering believed to occur in the auditory 
system. This corresponds to series of band pass filters 
with constant bandwidth and spacing on a Mel frequency 
scale. On a linear frequency scale, this spacing is 
approximately linear up to | Khz and logarithmic at 
higher frequencies (to see Fig. 1) [11] 


Many speech recognition systems are based on the 
MFCC approach and its first and second order derivative. 
The derivative normally approximate through an 
adjustment in the line of linear regression towards an 
adjustable size segment of consecutive information 
frames. The resolution of time and the smoothness of the 
estimated derivative depend on the size of the segment. 
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Fig.1 The Mel Filter Bank 


The computation of MFCCs follows the steps: 
e Converting the signal in small segments 
e Computing the Discrete Fourier Transform 
e The spectrum converts into a logarithmic 


scale 

e The scale is transformed into a soft MEL 
spectrum 

e The discrete cosine transform (DCT) 
computed 


The above mentioned algorithm is illustrated in Fig. 2. 
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Fig. 2 The MFCC’s computation from a cry signal 
The Artificial Neural Network (ANN) 


The use of ANN has been a great impact in the 
development of several research areas like computer 
vision, autonomous vehicle, pattern recognition, 
connected-speech synthesis and more recently into the 
classification of cry units [1-2] [4]. In the paper the ANN 
structure used is shown in Fig. 3. It corresponds to a 
Feed-Forward network in which xl, x2, x3, ..., xn 
represent the acoustic features of signals and yl, t2, 
y3,...., ym the m classes to be identified. This kind of 
supervised ANN has been also used in cry classification 
with succeed [11]. 
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Fig.3 A Feed-Forward architecture 
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In order to increase the efficiency of the learning 
process the Method of Scaled Conjugate Gradient 
(MSCG) is chosen. [13]. The MSCG algorithm shows a 
linear convergence accentuated in most of the problems. 


HI. RESULTS 


Starting from the primary information in BDLlanto, a 
segmentation process was developed to generate the cry 
units being obtained 73 healthy cry units and 68 
pathological cry units (relative to hypoxia). 58 cry units 
were chosen (for class) for training and 10 for 
classification. The segmentation stage was semi- 
automatic combining a begin/end detection (based on 
function energy and zero-crossing rate) and a manual 
correction to reduce the negative effect of considering 
inappropriate sections within the cry unit. In the Figure 4 
the scheme of the combined classifier is presented. 


Stridor, ui 
Voicedness, ae FN1 
Acoustic Threshold- Index 
Parameter Based 
Extraction Classifier \ 
x PCA DE N 
P 


7 Final 

MFCC ANN- ech, 

Cry Units PESSOA Decision, 
Estimation 


based 
Fig. 4. Block diagram of the combined cry-based 
classifier 


BDLlanto 


Classifier 


Starting from the cry units obtained from database a 
parameter estimation for every cry unit is done, following 
two possible ways: 


(a) estimation of 4-acoustic features for the threshold- 
based classifier: . the estimated feature is then compared 
with the normal threshold values associated to each one 
of the 4 selected parameters, generating to the exit an 
index FN1 with the following gradation: 
FNI: 0.25 for 1 parameter altered 
0.5 for 2 parameters altered 
0.75 for 3 parameters altered 
1.0 for 4 parameters altered 
0 for no one parameter altered (normality 
index) 
(b) estimation of MFCC 's for the ANN-based classifier. 
500 MFCC’s were computed for each generated cry 
unit (because of the differences in time duration among 
the cry units it was necessary to normalize and to adjust 
the vector of coefficients). 
After the initial vector of characteristics was passed 
through the analysis of principal components (PCA) the 
dimension of the vector was definitively reduced to 50 


Newborn infant cry 


principal components. Then the input vector to the ANN 
was presented, with the following structure: 50 nodes in 
the input layer, 15 nodes for the hidden layer and finally 
2 nodes for the output layer. To detect the cry type in the 
newborn the output values of the net are analyzed. The 
output values of the net are coded between 0 and 1. If the 
value of the output node 1 is bigger than the value of the 
output node 2 the sample is assigned to the class “normal” 
(N) generating a FN2 index equal to 0, otherwise it is 
assigned to the class “pathologic” (P) generating a FN2 
index equal to 1 

Finally both FN1 and FN2 indexes are processed in a 


FN1+FN2 
2 


classes-based decision with 3 qualitative levels: 
Normal D <=0.5 
Moderately- pathologic D= 0.75 
Pathologic D = 1.0 


decision block (D = ) resulting in two 


IV. DISCUSSION 
The following table shows results from the Combined 
Classifier. 


Table 1. The output results from the combined classsifier. 


Confusion D index % of 
Matrix Classi 
Norm | Patho | x<= 0.5<x | 0.75< | ficati 
al logy |0.5 <=0.7 | x on 
5 
Normal 10 10 0 10 0 0 100 
Pathology | 10 2 8 2 7 1 80 
Total 20 12 7 1 90 


The gradation in the D index let physicians to use 
properly the output of the cry classifier in order to 
compare and to evaluate its "possible meaning" in front 
of the results from the neurophysiological evaluation of 
the newborn (how much abnormal the infant cry is from 
the acoustical point of view and its “weight” for 
diagnostic purpose). The need to include more acoustic 
features in cry classifier for better classification rates 
proposed and argued by Schonweiller in 1996, is well 
demonstrated here. An interesting aspect that deserve to 
be commented is the fact that the only two cry units 
misclassified as normal obtained a FN1 equal to 0.75 
(significative abnormal for the threshold-based classifier), 
so both outputs from the classifiers also offer valuable 
information to be considered by the specialists. 

The soft tools used in the experience were: BDLlanto 
database with 12 seconds- cry recordings of Cuban 
children, BPVOZ soft-package, PCVOX and praat 
software for the acoustic signal processing. The ANN 
implementation (including the MSGC algorithm) was 
done with Neural Network Toolbox from Matlab v. 6.0. 
[14-15] 
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V. CONCLUSION 


The articulated processing of cry signals was well 
implemented in order to improve the effectiveness of a 
N/P cry classification, obtaining satisfactory results. Both 
output indexes FN1 and FN2 offer also valuable 
information for specialists when they analyze them 
together or in separate environment. . The cry unit as a 
basic element for signal processing displayed also a 
positive performance during the research experience. The 
use of this articulated-signal processing will be the 
keystone for a new cry-based methodology for newborn 
diagnosis with CNS disorders (based on hypoxia) to be 
issued by the Group of Speech Processing. 
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Abstract: Recent works emphasize the importance of 
acoustic cues of species-specificity in primate vocal 
communication. The potential of vocal tract resonance 
in generating these cues is examined by anatomically 
based vocal tract computational modeling. True 
lemurs (genus Eulemur), which occur in Madagascar, 
show a remarkable species diversity and this makes 
especially good model species to study these inter- 
specific differences. The oral vocal tract of lemurs is 
relatively flexible, but the nasal tract also plays a 
crucial role in their communicative system. We 
analyzed distinctive formant characteristics as 
produced by the computational models in order to 
investigate inter- and intra-specific variation in the 
vocal tract size and shape. Differences in 
morphological features between lemur taxa have an 
influence on shaping structural characters of their 
vocalizations. 

Keywords: Prosimians, morphology, Eulemur, vocal 
behaviour. 


I. INTRODUCTION 


The evolution of species-specific traits in 


communication signals is the result of complex 
interactions of neurocognitive and morphophysical 
factors. 


In modern studies, the application of the source-filter 
model is central insight for the interpretation of mammal 
vocal production. The application of the source-filter 
model to non-human animals stressed the importance of 
formants in animal vocal communication [1][2]. 

Several studies demonstrated that formant-like band in 
animal sound are the products of the resonance of sound 
propagation in the vocal tract. [3][4] 

However, the communicative importance of formants 
in non-humans is less manifest. Still little attention has 
been dedicated to the role formants play in conveying 
information. Researchers seem to agree on two kind of 
information conveyed: individual identity and body size. 

The fact that formants are influenced by the length of 
the vocal tract [5], and thus by body size [6][7], was 
investigated in some recent studies and it has been 
demonstrated that birds and mammals can spontaneously 
perceive formants [8]. Hauser and Fitch [9] also 
suggested that communication via formants belonged to 
terrestrial vertebrates long time before the origin of 
humans. An open question is whether formants may play 


a role in conveying information on species specificity. 
Investigations of vocalizations in lemurs may have 
special importance because, even if DNA sequence 
analyses have yielded a broad consensus for phylogenetic 
relationships between Eulemur, Hapalemur, Lemur and 
Varecia, further relations between taxa are still 
controversial [10]. In fact, quantitative analyses of 
Eulemur species sounds are scarce, it is known that low- 
pitched sounds emitted by lemurs radiate from the 
nostrils [11] and that they possess species-specific 
acoustic features [12]. 

In this paper, we investigate the relevance of vocal 
tract morphology in determining differences in formant 
values and formant dispersion in the lemurs of 
Madagascar using vocal tract modelling, here applied to 
the study of resonances in the nasal airways. 


II. METHODS 


One specimen per species for Eulemur rubriventer, 
Eulemur macaco and Eulemur fulvus, belonging to the 
collection of dead animals of Dept. Faune, Parc 
Botanique et Zoologique de Tsimbazaza (Antananarivo, 
Madagascar) were partially defrozen and 
tracheotomyzed. The tracheal tube was injected with 
silicon rubber until complete filling of the oral and nasal 
cavities, passing by the larynx, and then clamped. All 
length and dimension measurements of the cast were 
taken with a Mitutoyo digital caliper (accurate to 0.01 
mm). Measurements of the cross-sectional axes of the 
vocal tract were then taken over the casts (the cross- 
section was not generally circular), at an increment of 10 
mm. Cross-sectional areas were calculated starting from 
these measures in Microsoft Excel. Cross-sectional areas 
were used to build the vocal tract area function that 
represents the input of MatLab-based vocal tract 
modeling software [13]. Models of oral and nasal tract 
resonance in lemurs have successfully involved the use of 
concatenated tubes of varying cross-sectional areas [14]. 

Concatenated tube models of the nasal tract of each 
taxa were computed and the acoustic response was 
compared with formant measures taken from natural calls 
of the same species. Assuming that vocal tract 
morphology of a single dead animal’s vocal tract is 
representative for each species, we also considered 
formants predicted by tubes in which size and length was 
respectively increased and decreased of 10%. Given that 
length scales as the cube root of mass, we estimated to 
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take in account a body size variation of approximately 
30%, reasonably larger than adult natural variation. For 
each model, F1 and F2 were taken in account from the 
computed acoustic response. Comparisons were made 
between F1 and F2 from the computed transfer functions 
and real formants for the same species measured on 
natural vocalizations. Captive lemurs were recorded in 
several institutions across Europe and United States: 
Parco Natura Viva (Bussolengo-Vr, Italy), Mulhouse Zoo 
(France), Rheine Der Naturzoo and Koln Zoo (Germany), 
Apenheul (Apeldoorn, The Netherlands), St. Louis Zoo 
(USA), Twycross Zoo, Drusillas Park (Alfrinston), 
Blackbrook Zoo (Alton Towers), Colchester Zoo, Linton 
Zoo and Banham Zoo (UK), Parc Botanique et 
Zoologique de Tsimbazaza (Antananarivo, Madagascar). 
All recorded vocalizations were spontaneously emitted 
and we avoided the use of eliciting stimuli and playbacks. 
Minimum of 3 vocalizations for 39 lemurs were digitized 
and analyzed using Praat 4.6.01 [15]. 


III. RESULTS 


We used the nasal tract length measurements from the 
3 species to calculate expected formant values based on a 
simple tube model of the vocal tract [1][6][16]. The 
predicted formant values for a nasal tract length (Fig. 1) of 
8 cm (congruent for Eulemur rubriventer and Eulemur 
macaco) are: 1094 Hz (F1) and 3281 Hz (F2). The 
predicted formant values for a nasal tract length of 9 cm 
(Fig. 1, resembling Eu/emur fulvus) are: 972 Hz (F1) and 
2917 Hz (F2). 

Vocal tract area functions derived from the silicon cast 
were used to generate computational models for the nasal 
tracts of Eulemur rubriventer, Eulemur macaco and 
Eulemur fulvus. 

The computational model for the supraglottal vocal 
systems of the three species considered in this paper 
comprises a filter consisting of 8 (Eulemur rubriventer 
and Eulemur macaco) or 9 (Eulemur fulvus) concatenated 
tubes. These tubes are approximation of the anatomical 
components of the vocal tract: from the glottal 
constriction, through the nasopharyngeal cavity, to the 
nasal chambers and nostrils. As from previous studies, 
non-human primates vocalize alternatively through the 
oral or the nasal tract [7][14]. 

Calculations of acoustic response can be made on the 
basis of the anatomically correct concatenated tubes 
model, where fixed-length tubes change in size according 
to anatomical measurements, whereas variation of these 
parameters allows their significance to be determined. 

The acoustic response of the three nasal tract models 
showed differences between the species (Fig. 1): 472 Hz 
(F1) and 2276 Hz (F2) for Eulemur rubriventer, 1097 Hz 
(F1) and 2420 Hz (F2) for Eulemur macaco; 1005 Hz 
(F1) and 2263 Hz (F2) for Eulemur fulvus. Concatenated 
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tubes models in which segments were increased or 
decreased of 10% in length and areas (in agreement with 
observed body size variation) respectively exhibited first 
peaks in the transfer function at: 447-532 Hz (F1) and 
2074-2527 Hz (F2) for Eulemur rubriventer, 1010-1204 
Hz (F1) and 2219-2664 Hz (F2) for Eulemur macaco; 
918-1105 Hz (F1) and 2063-2508 Hz (F2) for Eulemur 
fulvus. 

Comparisons were made between computed transfer 
functions and real formants for the same species. Average 
individual values of Fl and F2, measured from natural 
calls of 15 Eulemur rubriventer, 13 Eulemur macaco and 
11 Eulemur fulvus specimens were then plotted with the 
acoustic output of the computational models (Fig. 1). 
Average Fl and F2 for E. rubriventer were 702+176 Hz 
and 2576+89 Hz respectively, F1 and F2 for E. macaco 
were 1311+200 Hz and 2772+117 Hz, 1082+300 Hz and 
2249+102 Hz Eulemur fulvus. 


Nt 


Fig. 1. Cumulative formant plot showing distribution of 
first (F1) and second (F2) formants: 8 cm (A) and 9 cm 
(A) simple tube models; concatenated tubes model 
predictions for Eulemur rubriventer (%), Eulemur 
macaco (0) and Eulemur fulvus (I); formants measured 
from natural calls for Eulemur rubriventer (©), Eulemur 
macaco (O) and Eulemur fulvus (O). 


IV. DISCUSSION 


Results presented in this paper are in agreement with 
previous investigations of lemur vocalisations, which 
have documented that resonance properties of the 
supralaryngeal tracts determine formants [14], which are 
useful to investigate differences between species. 

First formant predicted from the computational models 
showed remarkable differences between Eulemur 
rubriventer and Eulemur macaco/Eulemur fulvus. 
Relatively minor differences were found between 
Eulemur macaco and Eulemur fulvus. Second formant of 
between Eulemur rubriventer and Eulemur fulvus are 
very similar and Eulemur macaco exhibited slightly 
increased values. 


Non-human sounds 


Computational models indicate that vocal tract 
morphology of E. fulvus and E. macaco proportionally 
varies in length and size of the concatenated tubes, while 
E. rubriventer actually showed a different formant 
pattern, reflecting remarkable discrepancies in the nasal 
tract morphology. 

Observing formant variation in natural calls, it is 
possible to notice that all species tend to have greater 
variation than that predicted by the models, especially for 
Fl. In agreement with model outputs, Eulemur 
rubriventer and Eulemur macaco/Eulemur fulvus showed 
remarkable differences for F1. The analysis of the natural 
calls also showed smaller variation than the models 
predicted and that F2 values well separated the three 
species. 

A convincing explanation for differences between 
predicted and natural variation in the F1/F2 plot is that 
not all vocal tract morphological changes are strictly 
bound to body size variation. In particular, in some non- 
human primates species body size and vocal tract length 
show an allometric relationship and this can be well 
described in those sounds that allows a uniform tube 
model interpretation [6]. 

In those vocalizations that radiate through the nostrils, 
concatenated tubes models provided a more reliable 
prediction of F1 and F2 and the previous assumption does 
not imply that areas of the concatenated tubes 
proportionally vary with body size. 

Unfortunately, a precise resolution of this issue was 
prevented by a lack of data documenting any 
disproportionate anatomical differences between males and 
females, or sub-adults and adults within a prosimian 
species [17]. 


V. CONCLUSION 


Grunt vocalizations from three species of Madagascar 
lemurs showed consistent species-specific characteristics. 

The results showed that a species-specific morphology 
of the nasal tract in some lemur species effectively 
determine formant frequencies in  species-typical 
vocalizations. The degree of difference between species, 
as based both on the results of the acoustic analysis and 
on the acoustic response of vocal tract models changes in 
relation to the species. 

Differences in morphological features between lemur 
taxa have an influence on specific structural characters of 
their vocalizations. 
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Abstract: This paper deals with objective analysis of 
the singing voice, and aims at giving non-professional 
singers both an aid to improve their voice capabilities 
and a criterion to prevent a wrong vocal attitude 
(head, neck, and body posture) that could even cause 
vocal pathologies. 

A new standalone application with a user-friendly 
interface is proposed, for robust and reliable analysis 
of singing voice characteristics. The tool performs 
tracking of fundamental frequency and formants, 
along with an objective measure of main singing voice 
parameters, such as vibrato rate (V_rate), vibrato 
extent (V_ext), and vocal intonation (V_int). A side- 
view camera allows displaying and recording the 
singer’s posture. 

Data are collected at the School of Music in Fiesole, 
Firenze, Italy, under the supervision of a voice teacher 
and a teacher of Alexander Technique (AT). First 
results are presented to compare some vocalizations 
coming both from professional and non-professional 
singers under different singer’s postures. 

Keywords: objective voice parameters estimation, 
singing voice, Alexander Technique. 


I. INTRODUCTION 


At present, singing is learned basically by means of the 
perception and the psycho-physical control of the singer 
during his/her performance. Also, it is mainly up to the 
singing teacher to perceptually evaluate the quality of a 
performance. This makes it difficult defining standard 
procedures and reference values, also because few 
objective means to evaluate singer capability and 
improvements are currently available [1-4]. 

Singing voice results from complex activity of the larynx 
and of vocal tract articulators, and is characterized by 
possibly high-pitched, rapidly time-varying signals. In this 
preliminary study, the basic features of the singing voice 
are considered, i.e. the fundamental frequency FO (linked 
to vocal folds oscillation), along with its modulation in 
time and frequency, and the formants Fi (resonance 
frequencies of the vocal tract), along with their energy. 
Time evolution, standard deviation (std) and maxima of 
such parameters over the whole vocalization are of 


importance for singers, being strictly related to correct 
vocal emission and hence to singer’s performance. 
Moreover, V_rate, V_ext and V_int are of importance, in 
order to give the singer useful information on the degree 
of achieved professional level, possibly as compared to 
professional singers [4-6]. 

The AT postural technique has gained interest among 
singing teachers, for its possible advantages as a 
complement to vocal training. The AT is a method of re- 
education based on creating a dynamical, balanced 
relationship between head, neck and back, and, as a result, 
on one's whole body. Recent studies have shown that, 
after few months of AT application, singing voice 
becomes more resonant, and singing easier to perform [7- 
8]. Lessons are entirely practical. The teacher gives 
students helpful suggestions, also by means of a very 
skilful and subtle use of the hands. The student is taken 
through simple movements, like standing up, sitting down 
or walking, to understand the principles on which the 
dynamics of the whole body is based. The AT is taught in 
forty countries around the world. It is studied and taught 
since many years also at the School of Music in Fiesole, 
Italy. 


Il. METHODS 


Singing voice signals are analysed by means of a multi- 
purpose, user-friendly tool, based on robust analysis 
techniques capable to deal also with high-pitched, quasi- 
Stationary signals, that are among those under study. 

To track fast signal variations, the signal is divided into 
short frames, whose length adaptively varies according to 
varying signal characteristics: the higher the FO the shorter 
the frame length (kept fixed to 3 pitch periods). A 
voiced/unvoiced separation algorithm is implemented, to 
avoid parameter estimation on signal frames that have no 
harmonic content. 

FO tracking is achieved by means of a robust two-step 
procedure, based on well-established results [9]. High- 
resolution formant estimation is implemented, based on 
parametric AutoRegressive (AR) PSD evaluation. The AR 
model order p is automatically selected by the program 
according to subject and signal characteristics, based on a 
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simple relationship between p, F, (sampling frequency), L 

(vocal tract length, linked to age and sex), and c (sound 

speed) [10]. Colour-coded spectrograms are provided, 

with formants tracking (F1-F5 for singers) superimposed. 

Mean values and std are also shown. PSD plots complete 

the set of pictures, allowing detailed inspection of 

harmonic energy characteristics. 

A user-friendly interface (Fig. 1) allows selecting age, sex 

and type of vocal emission for each subject, performing 

computations without any other requirement. The 

software tool automatically adjusts internal settings for 

optimal frame length, frequency range of analysis and 

plots. Specifically, the interface allows for: 

— selecting data (.wav files); 

— choosing the voice type (new-born, singers, adults). The 
overall allowed Fo range is 40Hz<Fy<1300Hz; 

— selecting the kind of analysis: single audio file or two 
files, for comparison purposes. 

A moving bar shows the residual time during 

computations. For long files (>5s) and high sampling 

frequency (>40 kHz) the total time could approach 5min 

in total. 

A number of plots is displayed and saved in printable 

format, for a visual comparison of results. Specifically, for 

singing voice, FO, V_rate, V_ext, V_int, spectrogram, 

formants and PSD are plotted, all in coloured map. 

The software tool is developed under Matlab® R2006b. 
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Figure | — The user interface for voice analysis 
II. RESULTS 


First results concerning one professional mezzo-soprano 
and 3 students (2 tenors, 1st and 2nd year of course, resp., 
and 1 soprano, first year course) were obtained. Data were 
recorded with a professional directional microphone 
(SHURE SM58) equipped with a A/D board TASCAM 
US144, and stored on a notebook. The sampling rate was 
F,=44.1kHz, with 16 bit resolution. 

After proper warming up, singers performed vocalizations 
based on the Italian sustained /a/ vowel, at different FO 
values, under the supervision of both a singing teacher and 
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an AT teacher. The choice of /a/ comes from its 
reasonably independence from other factors, mainly the 
tongue position. As for students, with no experience with 
the AT, vocalizations were performed with “natural” 
posture first (i.e. without conscious control of their body), 
and then with the posture suggested by the AT teacher. 
Instead, the professional mezzosoprano, who is 
undertaking AT personal training since some years, 
performed “natural” /a/ vocalizations by intentionally 
keeping a non-balanced relationship between head, neck 
and back. Then, she applied the proper AT posture. 

To compare results, some plots and parameters are 
reported here concerning two cases: the professional 
mezzosoprano and one non-professional soprano, both 
emitting a sustained /a/ with vibrato. Figs. 2 and 3 show, 
from top to bottom: signal amplitude, FO, V_int, and the 
PSD. On the left: without AT; on the right: with AT. 
V_rate and V_ext are reported below for both cases. 
According to literature [3-5], a good quality vibrato range 
should be approximately: 5<V_rate<7.5 cycles/s, and 
V_ext<2 semitones (+/-1 semitone corresponds to a 
frequency swing of +/- 6 %, approximately). 

As for the professional mezzo-soprano, without applying 
the AT, the following parameters were obtained: 
FOmean=230.39Hz, V_rate=4.9 cycles/s, std=0.39cycles/s, 
V_ext=10.6Hz (2 semitones, ~28Hz), std=3.8Hz. After 
training with AT, the parameters were: FOmean=231.87Hz, 
V_rate=5.0 cycles/s  std=0.3cycles/s, | V_ext=25Hz, 
std=3.9Hz. Fig. 2 shows some results as obtained with the 
proposed tool. Notice a more regular vibrato and higher 
energy for the 3th-5th formants with AT. Also V_int is 
remarkably more stable. Perceptual evaluation confirms 
better quality of the AT vocalization, that seems to 
enhance the singer’s performance in this case. 

Fig.3 shows the results obtained for the non-professional 
soprano. Without applying the AT, we found: FOmean = 
439.7Hz, V_rate=5.2 cycles/s std=0.6cycles/s, 
V_ext=23.3Hz (2 semitones, ~53Hz), std=4.6Hz. After 
training with AT: FOmean=439.7Hz, V_rate=4.9 cycles/s 
std=0.4cycles/s, V_ext=22.5 Hz, std=3.8Hz. 

Notice that vibrato values are quite similar in both cases. 
However, different vocal strategies were applied, with 
different formants frequency and energy, especially above 
3 kHz, as shown in the PSD plot. With AT, better 
perceptual results were obtained. 

As for tenors, the analysis has shown no remarkable voice 
quality improvement with AT, in agreement with 
perceptual evaluation. This can be due to the following 
factors. 

Professional singers make use of a precise control of both 
laryngeal and vocal tract functions, with several and 
continuous adjustments, that make up the basic tools for 
good singing. On the contrary, students did not yet 
developed a good auditory and self-receptive feedback, 
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Figure 2 — Professional mezzo-soprano. Left: without AT; 

Right: with AT 


nor a reliable muscular training. This often causes a 
modified, para-physiological vocal emission, that involves 
the whole neck and shoulders, and hence an altered global 
posture. The AT makes unstable the attitudes that are 
deep-rooted in students’ postural habits and needs great 
psycho-physical concentration. This fact could even have 
made unstable the knowledge and compensatory skills 
used by students in vocal emissions before the AT teacher 
suggestion. Finally, the students involved in this 
experiment being non-professionals, were not accustomed 
to sing with an audience. The presence of microphone and 
camera could have further influenced their performance. 
Notice that, from a perceptive point of view, the AT has 
softened the onset of the vocalization in most cases. 
Though very limited in number and in time, first results 
allow for supposing that, if non-professionals add AT to 
their vocal training, they could found easier to enter the 
main functions related to a proper sound emission, and 
could be facilitated in overcoming limitations, such as 
limited vocal extension, voice breaks, an improper use of 
vocal registers, and often vocal fatigue. As functions are 


Sustained /a/ with Vibrato 
i 1 T 7 T T T T T 
E 0.51) 
E 0 
5-05 4 
z h A 1 , f f 
0 0.2 0.4 0.6 0.8 1 1.2 1.4 
Time [s] 
Mean FO : 231.87 [Hz] 
o 
= I I | I I , 
2 A EN AN Oa 
= 230} Ara AA e tr as Sl ae 
Io Mef l a | fl ‘| #1 pol 
2 I I Me I DI I I 
L L L a L L i 
200, 0.2 0.4 0.6 1 1.2 1.4 


Vocal Intonation 


0 0.5 1 1.5 
Time [Hz] 


dashed: FFT ; solid: AR(44) 
0 T f T T T 


N 
i=} 


Mean PSD [dB] 
i=} 


[=] 


a 
e 


2000 3000 4000 5000 6000 
Freq. [Hz] 


o 
= 
el 
o 
o 


carried out on posture, a good postural balance allows for 
the cheapest usage of a function, and makes it possible to 
perform even subtle adjustments. 

Features of the professional singer were found in 
agreement to those proposed in literature. If a larger set of 
data will be available, a reference set for non- 
professionals could be set up. 


IV. FINAL REMARKS 


A user-friendly, robust tool for voice analysis has been 
presented. It allows for the analysis of voice recordings, in 
a wide range of FO values, that makes the tool a multi- 
purpose one. At present, the new tool works off-line. If 
properly implemented, it would allow for real-time 
analysis of voice signals. 

As for singing voice, preliminary results show different 
FO and formant strategies, as related to singing technique 
and/or posture. Hence, it could be of help in giving non- 
professional singers and singing teachers reliable 
objective measures of possible improvements during and 
after training with any teaching technique. 
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Figure 3 — Non-professional soprano. Left: without AT; 
Right: with AT 


Collecting and analyzing several audio/video files (not 
reported here due to space limitations) is going on, in 
order to give more reliable results. 

Finally, further studies are needed to investigate in detail 
the influence of a proper posture in singing voice 
production. 
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Mozart’s voice 


"Mozart's Voice" 


A cultural appetizer 


Philippe DeJonckere 
University of Utrecht 
Utrecht, The Netherlands 


Mozart is mainly known to us as a genius of musical composition, and as 
a virtuoso keyboard player as well as violinist/violist. He composed major 
works for voice, but what do we know about his own (singing) voice ? 


When searching carefully in contemporary documents, as letters, 
posters, diaries, written testimonies, we can learn a lot about what and 
how Mozart himself sang: from his first public performance as a 5-years 
old choir-boy in Sigismundus Rex Hungariae to the rehearsal of the 
unfinished Requiem at his home a few hours before his death, where he 
sang himself the alto part, forced to stop after the first bars of the 
“Lacrymosa”. 


This vocal pilgrimage, with 180 slides and several musical illustrations, 
provides a fascinating look on Mozart's life and time from a double view- 
point: the musicological one and that one of the voice scientist. 
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